-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Please go to Stack Overflow for help and support:
https://stackoverflow.com/questions/tagged/tensorflow
If you open a GitHub issue, here is our policy:
- It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
- The form below must be filled out.
- It shouldn't be a TensorBoard issue. Those go here.
Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
System information
-
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
-
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 18.04.1 (Kernel version: 4.15.0-35-generic Enable FloorDiv on ROCm #38-Ubuntu SMP)
rocm version : 1.9.211
rocm-opencl: 1.2.0-2018090737 -
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
N/A -
TensorFlow installed from (source or binary):
http://repo.radeon.com/rocm/misc/tensorflow/tensorflow_rocm-1.8.0-cp35-cp35m-manylinux1_x86_64.whl -
Python version:
Python 3.5.2 -
GPU model and memory:
3x Vega Frontier Edition (connected via PCIe riser) + onboard VGA
Motherboard: Supermicro X10SRL-F
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz -
Exact command to reproduce:
$ export HIP_VISIBLE_DEVICES=0
$ python train.py # included below
Describe the problem
After setting HIP_VISIBLE_DEVICES=0, I was able to run the python script included above. However it was extremely slow during the model.fit call, apparently being blocked from time to time due to driver issue.
Looking at dmesg output, I found many repeated error messages:
"[drm:invalidate_tlbs [amdgpu]] ERROR wait for kiq fence error: 0."
Interestingly, I also tried running clinfo, and it appeared to unblock the python script, allowing it to proceed quickly. The bad news is it ends up in a core dump. Here's the relevant output from the script:
res = model.fit(x_train, y_train, epochs=30)
Epoch 1/30
60000/60000 [==============================] - 358s 6ms/step - loss: 0.0353 - acc: 0.9885
Epoch 2/30
60000/60000 [==============================] - 6s 105us/step - loss: 0.0322 - acc: 0.9892
Epoch 3/30
60000/60000 [==============================] - 12s 202us/step - loss: 0.0279 - acc: 0.9911
Epoch 4/30
49152/60000 [=======================>......] - ETA: 1s - loss: 0.0235 - acc: 0.9925Memory access fault by GPU node-1 (Agent handle: 0x55e318bc22c0) on address (nil). Reason: Page not present or supervisor privilege.
Aborted (core dumped)
Source code / logs
content of train.py:
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
res = model.fit(x_train, y_train, epochs=30)
model.evaluate(x_test, y_test)