TensorFlow `tf.matmul` ends up using CPU backend for 32bit floats #14120

Micket · 2021-10-11T14:41:59Z

This was brought up in the slack, but it seems our TF CUDA builds ends up using _MklMatMul on the CPU when using 32 bit floats.

import tensorflow as tf
tf.debugging.set_log_device_placement(True)
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
b = tf.constant([[1.0, 2.0], [5.0, 6.0]])
c = tf.matmul(a,b) # Calls _MklMatMul on CPU
d = a + b # Other operations correctly uses the GPU.

In my testing, version 2.2.0, 2.3.1, 2.4.1, 2.5.0 all have this problem, example output looks like

>>> c = tf.matmul(a,b)
2021-10-11 16:40:10.042624: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0
>>> d = a + b # Other operations correctly uses the GPU.
2021-10-11 16:40:10.200334: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0

TensorFlow 2.6.0 seems to work correctly (though I have not extensively tested all type of operations).
Using 16bit or 64bit floats they all use the GPU;

a = tf.constant([[1.0, 2.0], [3.0, 4.0]], dtype=tf.float64)
b = tf.constant([[1.0, 2.0], [5.0, 6.0]], dtype=tf.float64)
c = tf.matmul(a,b) # Calls MatMul on the GPU

Containers with TF doesn't seem to have this issue, so, it's something special to our builds. Perhaps the MKL stuff should be disabled somehow for CUDA builds?

Micket · 2021-10-11T14:56:29Z

Just to clarify; these example matrices are tiny, but there doesn't seem to be a case where it dispatches based on size.

a = tf.constant(np.random.rand(10000, 10000), dtype=tf.float32)
b = tf.constant(np.random.rand(10000, 10000), dtype=tf.float32)
c = tf.matmul(a,b)

still uses the CPU

VRehnberg · 2021-10-11T15:08:40Z

On containers installed through pytorch/pytorch on docker hub or NVIDIA NGC the output for 16, 32 and 64 bit the result always seem to be

2021-10-11 16:55:43.695195: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-11 16:55:43.697281: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-11 16:55:43.714457: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

However, installing TF2.5.0 with conda through an overlay leads to:

2021-10-11 16:57:12.694294: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-11 16:57:12.700884: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-11 16:57:12.748378: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0

akesandgren · 2021-10-13T17:09:44Z

This seem to get solved by easybuilders/easybuild-easyblocks#2583 according to initial testing.
At least for TF 2.4.1

boegel · 2021-10-13T17:22:49Z

It seems like there's a runtime switch $TF_DISABLE_MKL to avoid the use of _MklMatMul (CPU) (found via tensorflow/tensorflow#33146)

# matmul.py corresponds to the code in first code block in the issue description

$ module load TensorFlow/2.5.0-fosscuda-2020b

$ python matmul.py
...
2021-10-13 18:37:14.783580: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-13 18:37:15.325011: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0

$ TF_DISABLE_MKL=1 python matmul.py
...
2021-10-13 18:37:59.804746: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
...
2021-10-13 18:38:04.452648: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0

With TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1 (which was configured with --config=mkl), I can confirm that MatMul is always run on GPU:

$ module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
$ python matmul.py
...
2021-10-13 18:41:43.324666: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-13 18:41:48.961807: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2021-10-13 18:41:49.227718: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0

Micket added the bug report label Oct 11, 2021

Micket added this to the next release (4.5.0?) milestone Oct 11, 2021

boegel added the performance label Oct 13, 2021

boegel mentioned this issue Oct 13, 2021

don't use --config=mkl for TensorFlow 2.4+ easybuilders/easybuild-easyblocks#2583

Merged

boegel closed this as completed in easybuilders/easybuild-easyblocks#2583 Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlow `tf.matmul` ends up using CPU backend for 32bit floats #14120

TensorFlow `tf.matmul` ends up using CPU backend for 32bit floats #14120

Micket commented Oct 11, 2021 •

edited

Loading

Micket commented Oct 11, 2021

VRehnberg commented Oct 11, 2021

akesandgren commented Oct 13, 2021

boegel commented Oct 13, 2021

TensorFlow tf.matmul ends up using CPU backend for 32bit floats #14120

TensorFlow tf.matmul ends up using CPU backend for 32bit floats #14120

Comments

Micket commented Oct 11, 2021 • edited Loading

Micket commented Oct 11, 2021

VRehnberg commented Oct 11, 2021

akesandgren commented Oct 13, 2021

boegel commented Oct 13, 2021

TensorFlow `tf.matmul` ends up using CPU backend for 32bit floats #14120

TensorFlow `tf.matmul` ends up using CPU backend for 32bit floats #14120

Micket commented Oct 11, 2021 •

edited

Loading