Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow tf.matmul ends up using CPU backend for 32bit floats #14120

Closed
Micket opened this issue Oct 11, 2021 · 4 comments · Fixed by easybuilders/easybuild-easyblocks#2583
Closed

Comments

@Micket
Copy link
Contributor

Micket commented Oct 11, 2021

This was brought up in the slack, but it seems our TF CUDA builds ends up using _MklMatMul on the CPU when using 32 bit floats.

import tensorflow as tf
tf.debugging.set_log_device_placement(True)
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
b = tf.constant([[1.0, 2.0], [5.0, 6.0]])
c = tf.matmul(a,b) # Calls _MklMatMul on CPU
d = a + b # Other operations correctly uses the GPU.

In my testing, version 2.2.0, 2.3.1, 2.4.1, 2.5.0 all have this problem, example output looks like

>>> c = tf.matmul(a,b)
2021-10-11 16:40:10.042624: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0
>>> d = a + b # Other operations correctly uses the GPU.
2021-10-11 16:40:10.200334: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0

TensorFlow 2.6.0 seems to work correctly (though I have not extensively tested all type of operations).
Using 16bit or 64bit floats they all use the GPU;

a = tf.constant([[1.0, 2.0], [3.0, 4.0]], dtype=tf.float64)
b = tf.constant([[1.0, 2.0], [5.0, 6.0]], dtype=tf.float64)
c = tf.matmul(a,b) # Calls MatMul on the GPU

Containers with TF doesn't seem to have this issue, so, it's something special to our builds. Perhaps the MKL stuff should be disabled somehow for CUDA builds?

@Micket Micket added this to the next release (4.5.0?) milestone Oct 11, 2021
@Micket
Copy link
Contributor Author

Micket commented Oct 11, 2021

Just to clarify; these example matrices are tiny, but there doesn't seem to be a case where it dispatches based on size.

a = tf.constant(np.random.rand(10000, 10000), dtype=tf.float32)
b = tf.constant(np.random.rand(10000, 10000), dtype=tf.float32)
c = tf.matmul(a,b)

still uses the CPU

@VRehnberg
Copy link
Contributor

On containers installed through pytorch/pytorch on docker hub or NVIDIA NGC the output for 16, 32 and 64 bit the result always seem to be

2021-10-11 16:55:43.695195: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-11 16:55:43.697281: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-11 16:55:43.714457: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

However, installing TF2.5.0 with conda through an overlay leads to:

2021-10-11 16:57:12.694294: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-11 16:57:12.700884: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-11 16:57:12.748378: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0

@akesandgren
Copy link
Contributor

This seem to get solved by easybuilders/easybuild-easyblocks#2583 according to initial testing.
At least for TF 2.4.1

@boegel
Copy link
Member

boegel commented Oct 13, 2021

It seems like there's a runtime switch $TF_DISABLE_MKL to avoid the use of _MklMatMul (CPU) (found via tensorflow/tensorflow#33146)

# matmul.py corresponds to the code in first code block in the issue description

$ module load TensorFlow/2.5.0-fosscuda-2020b

$ python matmul.py
...
2021-10-13 18:37:14.783580: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0
2021-10-13 18:37:15.325011: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0

$ TF_DISABLE_MKL=1 python matmul.py
...
2021-10-13 18:37:59.804746: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
...
2021-10-13 18:38:04.452648: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0

With TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1 (which was configured with --config=mkl), I can confirm that MatMul is always run on GPU:

$ module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
$ python matmul.py
...
2021-10-13 18:41:43.324666: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2021-10-13 18:41:48.961807: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2021-10-13 18:41:49.227718: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants