Very slow computation (with lots of drm:invalidate_tlbs errors in dmesg)

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
2. The form below must be filled out.
3. It shouldn't be a TensorBoard issue. Those go [here](https://github.com/tensorflow/tensorboard/issues).

**Here's why we have that policy**: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

------------------------

### System information
- **Have I written custom code (as opposed to using a stock example script provided in TensorFlow)**:

- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**:
Ubuntu 18.04.1 (Kernel version: 4.15.0-35-generic #38-Ubuntu SMP)
rocm version : 1.9.211
rocm-opencl: 1.2.0-2018090737

- **Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device**:
N/A

- **TensorFlow installed from (source or binary)**:
http://repo.radeon.com/rocm/misc/tensorflow/tensorflow_rocm-1.8.0-cp35-cp35m-manylinux1_x86_64.whl

- **Python version**:
Python 3.5.2

- **GPU model and memory**:
3x Vega Frontier Edition (connected via PCIe riser) + onboard VGA
Motherboard: Supermicro X10SRL-F
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

- **Exact command to reproduce**:
$ export HIP_VISIBLE_DEVICES=0
$ python train.py #  included below

### Describe the problem

After setting HIP_VISIBLE_DEVICES=0, I was able to run the python script included above.  However it was extremely slow during the model.fit call, apparently being blocked from time to time due to driver issue.

Looking at dmesg output, I found many repeated error messages:
"[drm:invalidate_tlbs [amdgpu]] *ERROR* wait for kiq fence error: 0."

Interestingly, I also tried running clinfo, and it appeared to unblock the python script, allowing it to proceed quickly.  The bad news is it ends up in a core dump. Here's the relevant output from the script:

>>> res = model.fit(x_train, y_train, epochs=30)
Epoch 1/30
60000/60000 [==============================] - 358s 6ms/step - loss: 0.0353 - acc: 0.9885
Epoch 2/30
60000/60000 [==============================] - 6s 105us/step - loss: 0.0322 - acc: 0.9892
Epoch 3/30
60000/60000 [==============================] - 12s 202us/step - loss: 0.0279 - acc: 0.9911
Epoch 4/30
49152/60000 [=======================>......] - ETA: 1s - loss: 0.0235 - acc: 0.9925Memory access fault by GPU node-1 (Agent handle: 0x55e318bc22c0) on address (nil). Reason: Page not present or supervisor privilege.
Aborted (core dumped)


### Source code / logs

content of train.py:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

res = model.fit(x_train, y_train, epochs=30)
model.evaluate(x_test, y_test)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Very slow computation (with lots of drm:invalidate_tlbs errors in dmesg) #180

System information

Describe the problem

Source code / logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Very slow computation (with lots of drm:invalidate_tlbs errors in dmesg) #180

Description

System information

Describe the problem

Source code / logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions