ROCm-3.7+ broken on gfx803 #1265

xuhuisheng · 2020-10-23T02:34:31Z

This issue is used to tracing issues on ROCm-3.7+ with gfx803 - e.g. rx470, rx570, rx580.

If you had a gfx803, and installed ROCm-3.7, ROCm-3.8, ROCm-3.9, ROCm-4.0. You could meet some situations likes:

Invalid argument: indices[5,284] = 997212422 is not in [0, 5001)
Low accuracy with loss NaN

Related issues:

Crash in TensorFlow: "Invalid argument: indices[5,284] = 997212422 is not in [0, 5001)" in classification tutorial #1203
NaN loss using Keras Sequential model and mse as loss metric #1264
Inconsistent behaviour of .predict in docker images tensorflow-upstream#1105
TF.Keras Load Model predict low accuracy when using ROCm tensorflow-upstream#1144
Loss become a NaN using tensorflow-rocm (gfx803) rocSPARSE#215

My advise is:

Downgrade to ROCm-3.5.1, e.g. ubuntu repo : http://repo.radeon.com/rocm/apt/3.5.1
OR buy gfx900(vega56/64), gfx906(radeon vii) to test
OR Building rocBLAS with BUILD_WITH_TENSILE_HOST=false (need large mem, AMD suggest 64G at least)

Investigation info:

Building rocBLAS with BUILD_WITH_TENSILE_HOST=false can solve this issue.
From ROCm-3.7, the rocBLAS update BUILD_WITH_TENSILE_HOST from false to true, so ROCm-3.5.1 can work properly.
Seems the new tensile client which rocBLAS used does not support gfx803.

Created an issue to rocBLAS : ROCm/rocBLAS#1172

ghost · 2020-10-27T06:20:24Z

Hi @xuhuisheng,

Thank you for bringing this. Let me check about this. Are you using Fiji?

xuhuisheng · 2020-10-27T06:35:58Z

@ashutoshamd Thank you for reply.
Its a long time, that gfx803 cannot execute tensorflow-rocm and pytorch, since ROCm-3.7 2020-08-21.

Tensorflow offical sample could reproduce this issue, almost 90%. https://www.tensorflow.org/tutorials/keras/text_classification
Reproduce step:

run ubuntu:20.04 using docker
install ROCm-3.8
install tensorflow-rocm 2.3.1
download text_classification and run

My environment:

OS: Ubuntu-20.04
CPU: Xeon 2620v3
GPU: RX580 8G (Polaris10) CHIP ID: 0x67df
Python: 3.8.5
Tensorflow-rocm: 2.3.1

AsimPoptani · 2020-10-30T12:49:03Z

How does one downgrade?

rkothako · 2020-11-18T07:52:41Z

Hi @AsimPoptani
There is no way to downgrade from a specific version.
You can uninstall existing version of ROCm and can install specific ROCm version from like http://repo.radeon.com/rocm/apt/3.x/
x means 9 or 8 or any other number

jpsollie · 2020-12-12T05:59:22Z

I'm encountering the same issue here: benchmarking rocm 3.9 and 3.10 on a system with 2x R9 nano gpus is > 10x slower on ROCm compared to clover (if it runs at all):
ˋecho $(date +%%%s.%N%% && ./a.out && date +%%%s.%N%%) >> rocm.txtˋ
gives (executed 3 times):
ˋˋˋ
%1607587105.151533938% Result: 11168608085589920491 Runtime: 0.012519ms %1607587130.296327194% %1607670831.441274542% Result: 11168608085589920491 Runtime: 0.013072ms %1607670855.944835450% %1607670999.627702896% Result: 11168608085589920491 Runtime: 0.012555ms %1607671024.114541166%
ˋˋˋ
while on clover, it becomes:
ˋˋˋ
%1607525532.830965431% Result: 11168608085589920491 Runtime: 0.000692ms %1607525557.858665546% %1607525898.437019510% Result: 11168608085589920491 Runtime: 0.001562ms %1607525923.446692449% %1607525926.138453752% Result: 11168608085589920491 Runtime: 0.000700ms %1607525950.744559860%
ˋˋˋ
the benchmark was a slightly modified version of opencl-benchmark here on github, modified to ensure its correctness while calculating numbers + increased the load to get more accurate results. If you need it, let me know

ROCmSupport · 2020-12-14T06:23:34Z

Hi @jpsollie
Looks like its a different issue and I recommend to file a new ticket.
Thank you.

ROCmSupport · 2021-01-04T08:19:45Z

AMD dropped supporting gfx8 officially from ROCm 4.0 as per https://github.com/RadeonOpenCompute/ROCm#hardware-and-software-support . But some things might work.
Hence closing this issue.
Thank you.

AsimPoptani · 2021-01-04T10:38:49Z

@ROCmSupport this seems a bit ridiculous as nvidia still support the 10 series graphics cards which are just as old...

AsimPoptani · 2021-01-04T10:39:41Z

Also @ROCmSupport link is broken

ROCmSupport · 2021-01-04T12:05:35Z

It is https://github.com/RadeonOpenCompute/ROCm#hardware-and-software-support.
Clicking this link is failed to open. Request to copy and paste the link in a browser and enter.

boriswinner · 2021-01-26T09:56:15Z

Here is my guide to downgrade to ROCm 3.5.1 + TensorFlow 2.2:
https://github.com/boriswinner/RX580-rocM-tensorflow-ubuntu20.4-guide

gabrielziegler3 · 2021-03-03T14:48:43Z

Has anyone on Arch uploaded a working ROCm build yet? I've been struggling with this now.

9Tito · 2021-06-17T02:00:18Z

I resolved this problem using this: https://githubmemory.com/repo/xuhuisheng/rocm-gfx803

but i have to add this: tf.compat.v1.disable_eager_execution()

HOWEBER, my routine without using the graphics card (with an AMD® Ryzen 5 2600 six-core processor × 12) is 2 times faster than using my graphics card :(. Is it normal? Is it because it is not the best graphics card for ML?

PD: not underground my ROCm, i don't know how and check witch version i have.

xuhuisheng mentioned this issue Oct 23, 2020

ROCm-3.7+ broken on gfx803 ROCm/rocBLAS#1172

Closed

xuhuisheng mentioned this issue Oct 29, 2020

monitor turn off after start tensorflow training #1270

Closed

Grench6 mentioned this issue Oct 31, 2020

Is the RX 590 compatible with WSL2? #1249

Closed

xuhuisheng mentioned this issue Nov 2, 2020

ROCm-3.9, ROCm-3.10 crash with gfx803 #1269

Closed

xuhuisheng mentioned this issue Nov 15, 2020

ROCm 3.9.1 repo broken on ubuntu 18.04.5LTS? #1289

Closed

xuhuisheng mentioned this issue Nov 22, 2020

Loss become a NaN using tensorflow-rocm (gfx803) ROCm/rocSPARSE#215

Closed

borgarpa mentioned this issue Nov 30, 2020

ROCm 3.5.1 - Radeon RX 570 GPU bandwidth too low #1309

Closed

ghost mentioned this issue Dec 4, 2020

rocm-smi 3.9 & 3.10 returns error #1274

Closed

This was referenced Dec 28, 2020

Get hipErrorNoBinaryForGpu when running the HIP Sample ROCm/HIP#2166

Closed

Sad news, AMD drop offical supporting for gfx803 on ROCm-4.0 #1353

Closed

PRs to Gentoo justxi/rocm#8

Open

ROCmSupport closed this as completed Jan 4, 2021

jpsollie mentioned this issue Jan 5, 2021

ROCm OpenCL performance > 10x slower compared to clover #1337

Closed

xuhuisheng mentioned this issue Jan 24, 2021

Weird Validation-Loss Values ROCm/tensorflow-upstream#1238

Closed

inailuig mentioned this issue Jan 25, 2021

Add support for other GPUs (than NVIDIA) jax-ml/jax#2012

Closed

xuhuisheng mentioned this issue Mar 25, 2021

RX 570 on Fedora 33 (Linux 5.11 w/o dkms): "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" ROCm/tensorflow-upstream#1305

Closed

xuhuisheng mentioned this issue Jun 18, 2021

AmD rx570 fails with hipErrorNoBinaryForGpu ROCm/tensorflow-upstream#1386

Closed

atamazov mentioned this issue Sep 13, 2024

[Issue]: RCOM-6.1.3 crash of any OpenCL program (clinfo or custom) upon loading ROCm/clr#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm-3.7+ broken on gfx803 #1265

ROCm-3.7+ broken on gfx803 #1265

xuhuisheng commented Oct 23, 2020 •

edited

Loading

ghost commented Oct 27, 2020 •

edited by ghost

Loading

xuhuisheng commented Oct 27, 2020 •

edited

Loading

AsimPoptani commented Oct 30, 2020

rkothako commented Nov 18, 2020

jpsollie commented Dec 12, 2020 •

edited

Loading

ROCmSupport commented Dec 14, 2020

ROCmSupport commented Jan 4, 2021 •

edited

Loading

AsimPoptani commented Jan 4, 2021

AsimPoptani commented Jan 4, 2021

ROCmSupport commented Jan 4, 2021 •

edited

Loading

boriswinner commented Jan 26, 2021

gabrielziegler3 commented Mar 3, 2021

9Tito commented Jun 17, 2021

ROCm-3.7+ broken on gfx803 #1265

ROCm-3.7+ broken on gfx803 #1265

Comments

xuhuisheng commented Oct 23, 2020 • edited Loading

ghost commented Oct 27, 2020 • edited by ghost Loading

xuhuisheng commented Oct 27, 2020 • edited Loading

AsimPoptani commented Oct 30, 2020

rkothako commented Nov 18, 2020

jpsollie commented Dec 12, 2020 • edited Loading

ROCmSupport commented Dec 14, 2020

ROCmSupport commented Jan 4, 2021 • edited Loading

AsimPoptani commented Jan 4, 2021

AsimPoptani commented Jan 4, 2021

ROCmSupport commented Jan 4, 2021 • edited Loading

boriswinner commented Jan 26, 2021

gabrielziegler3 commented Mar 3, 2021

9Tito commented Jun 17, 2021

xuhuisheng commented Oct 23, 2020 •

edited

Loading

ghost commented Oct 27, 2020 •

edited by ghost

Loading

xuhuisheng commented Oct 27, 2020 •

edited

Loading

jpsollie commented Dec 12, 2020 •

edited

Loading

ROCmSupport commented Jan 4, 2021 •

edited

Loading

ROCmSupport commented Jan 4, 2021 •

edited

Loading