It looks like cublas (at least 10.0) by default does not use TensorCores.
In order to turn it on, the user must explicitly set the TC math mode :
https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode
"The cublasSetMathMode function enables you to choose whether or not to use Tensor Core operations in the library by setting the math mode to either CUBLAS_TENSOR_OP_MATH or CUBLAS_DEFAULT_MATH. Tensor Core operations perform parallel floating point accumulation of multiple floating point products. Setting the math mode to CUBLAS_TENSOR_OP_MATH indicates that the library will use Tensor Core operations in the functions: cublasHgemm(), cublasGemmEx, cublasSgemmEx(), cublasHgemmBatched() and cublasHgemmStridedBatched(). The math mode default is CUBLAS_DEFAULT_MATH, this default indicates that the Tensor Core operations will be avoided by the library. "
To enable TC :
cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
As today, afcuda does not set cublas MathMode, so the default is used : CUBLAS_DEFAULT_MATH.
As a consequence, afcuda today does not take advantage of TensorCores making f16 not really worth it to be used and actually also preventing to use fast f32 low precision gemm.
The goal of this issue is to allow the user of af/afcuda to set the cublas math mode : several options:
1 an environment variable : for example AF_CUDA_MATH_MODE or even AF_MATH_MODE
2 a generic C gemm api : for example: af_gemm_mode(AF_IEEE or AF_LOW_PRECISION ...)
3 a generic set backend option api : example: af_set_backend_option(backend, key, value) : example: af_set_backend_option(AFCUDA, "CUBLAS_MATH_MODE", "CUBLAS_TENSOR_OP_MATH")
4 any other idea
I would prefer option 1 not modifying the AF API, at least time to find a generic API valid for all backends.
Option 3 would be acceptable.
Kind
WT.
RFC:
@umar456 @9prady9 @syurkevi @mark-poscablo @gonzaloiglesiasiglesias
Refs: using TC math seems now the default in Tensorflow:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_blas.cc
https://fossies.org/linux/tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc
// Decide whether to enable TENSOR_OP_MATH
107 static bool TensorOpMathEnabled() {
108 static bool is_enabled = [] {
109 bool is_disabled;
110 TF_CHECK_OK(
111 tensorflow::ReadBoolFromEnvVar("TF_DISABLE_CUBLAS_TENSOR_OP_MATH",
112 /default_val=/false, &is_disabled));
113 return !is_disabled;
114 }();
115 return is_enabled;
116 }
It looks like cublas (at least 10.0) by default does not use TensorCores.
In order to turn it on, the user must explicitly set the TC math mode :
https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode
"The cublasSetMathMode function enables you to choose whether or not to use Tensor Core operations in the library by setting the math mode to either CUBLAS_TENSOR_OP_MATH or CUBLAS_DEFAULT_MATH. Tensor Core operations perform parallel floating point accumulation of multiple floating point products. Setting the math mode to CUBLAS_TENSOR_OP_MATH indicates that the library will use Tensor Core operations in the functions: cublasHgemm(), cublasGemmEx, cublasSgemmEx(), cublasHgemmBatched() and cublasHgemmStridedBatched(). The math mode default is CUBLAS_DEFAULT_MATH, this default indicates that the Tensor Core operations will be avoided by the library. "
To enable TC :
cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);
As today, afcuda does not set cublas MathMode, so the default is used : CUBLAS_DEFAULT_MATH.
As a consequence, afcuda today does not take advantage of TensorCores making f16 not really worth it to be used and actually also preventing to use fast f32 low precision gemm.
The goal of this issue is to allow the user of af/afcuda to set the cublas math mode : several options:
1 an environment variable : for example AF_CUDA_MATH_MODE or even AF_MATH_MODE
2 a generic C gemm api : for example: af_gemm_mode(AF_IEEE or AF_LOW_PRECISION ...)
3 a generic set backend option api : example: af_set_backend_option(backend, key, value) : example: af_set_backend_option(AFCUDA, "CUBLAS_MATH_MODE", "CUBLAS_TENSOR_OP_MATH")
4 any other idea
I would prefer option 1 not modifying the AF API, at least time to find a generic API valid for all backends.
Option 3 would be acceptable.
Kind
WT.
RFC:
@umar456 @9prady9 @syurkevi @mark-poscablo @gonzaloiglesiasiglesias
Refs: using TC math seems now the default in Tensorflow:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_blas.cc
https://fossies.org/linux/tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc
// Decide whether to enable TENSOR_OP_MATH
107 static bool TensorOpMathEnabled() {
108 static bool is_enabled = [] {
109 bool is_disabled;
110 TF_CHECK_OK(
111 tensorflow::ReadBoolFromEnvVar("TF_DISABLE_CUBLAS_TENSOR_OP_MATH",
112 /default_val=/false, &is_disabled));
113 return !is_disabled;
114 }();
115 return is_enabled;
116 }