Skip to content

Very slow computation (with lots of drm:invalidate_tlbs errors in dmesg) #180

@y-lu

Description

@y-lu

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.
  3. It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Ubuntu 18.04.1 (Kernel version: 4.15.0-35-generic Enable FloorDiv on ROCm #38-Ubuntu SMP)
    rocm version : 1.9.211
    rocm-opencl: 1.2.0-2018090737

  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    N/A

  • TensorFlow installed from (source or binary):
    http://repo.radeon.com/rocm/misc/tensorflow/tensorflow_rocm-1.8.0-cp35-cp35m-manylinux1_x86_64.whl

  • Python version:
    Python 3.5.2

  • GPU model and memory:
    3x Vega Frontier Edition (connected via PCIe riser) + onboard VGA
    Motherboard: Supermicro X10SRL-F
    CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

  • Exact command to reproduce:
    $ export HIP_VISIBLE_DEVICES=0
    $ python train.py # included below

Describe the problem

After setting HIP_VISIBLE_DEVICES=0, I was able to run the python script included above. However it was extremely slow during the model.fit call, apparently being blocked from time to time due to driver issue.

Looking at dmesg output, I found many repeated error messages:
"[drm:invalidate_tlbs [amdgpu]] ERROR wait for kiq fence error: 0."

Interestingly, I also tried running clinfo, and it appeared to unblock the python script, allowing it to proceed quickly. The bad news is it ends up in a core dump. Here's the relevant output from the script:

res = model.fit(x_train, y_train, epochs=30)
Epoch 1/30
60000/60000 [==============================] - 358s 6ms/step - loss: 0.0353 - acc: 0.9885
Epoch 2/30
60000/60000 [==============================] - 6s 105us/step - loss: 0.0322 - acc: 0.9892
Epoch 3/30
60000/60000 [==============================] - 12s 202us/step - loss: 0.0279 - acc: 0.9911
Epoch 4/30
49152/60000 [=======================>......] - ETA: 1s - loss: 0.0235 - acc: 0.9925Memory access fault by GPU node-1 (Agent handle: 0x55e318bc22c0) on address (nil). Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Source code / logs

content of train.py:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

res = model.fit(x_train, y_train, epochs=30)
model.evaluate(x_test, y_test)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions