Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm-3.7+ broken on gfx803 #1265

Closed
xuhuisheng opened this issue Oct 23, 2020 · 13 comments
Closed

ROCm-3.7+ broken on gfx803 #1265

xuhuisheng opened this issue Oct 23, 2020 · 13 comments

Comments

@xuhuisheng
Copy link
Contributor

xuhuisheng commented Oct 23, 2020

This issue is used to tracing issues on ROCm-3.7+ with gfx803 - e.g. rx470, rx570, rx580.

If you had a gfx803, and installed ROCm-3.7, ROCm-3.8, ROCm-3.9, ROCm-4.0. You could meet some situations likes:

  • Invalid argument: indices[5,284] = 997212422 is not in [0, 5001)
  • Low accuracy with loss NaN

Related issues:

My advise is:

  • Downgrade to ROCm-3.5.1, e.g. ubuntu repo : http://repo.radeon.com/rocm/apt/3.5.1
  • OR buy gfx900(vega56/64), gfx906(radeon vii) to test
  • OR Building rocBLAS with BUILD_WITH_TENSILE_HOST=false (need large mem, AMD suggest 64G at least)

Investigation info:

Building rocBLAS with BUILD_WITH_TENSILE_HOST=false can solve this issue.
From ROCm-3.7, the rocBLAS update BUILD_WITH_TENSILE_HOST from false to true, so ROCm-3.5.1 can work properly.
Seems the new tensile client which rocBLAS used does not support gfx803.

Created an issue to rocBLAS : ROCm/rocBLAS#1172

@ghost
Copy link

ghost commented Oct 27, 2020

Hi @xuhuisheng,

Thank you for bringing this. Let me check about this. Are you using Fiji?

@xuhuisheng
Copy link
Contributor Author

xuhuisheng commented Oct 27, 2020

@ashutoshamd Thank you for reply.
Its a long time, that gfx803 cannot execute tensorflow-rocm and pytorch, since ROCm-3.7 2020-08-21.

Tensorflow offical sample could reproduce this issue, almost 90%. https://www.tensorflow.org/tutorials/keras/text_classification
Reproduce step:

  1. run ubuntu:20.04 using docker
  2. install ROCm-3.8
  3. install tensorflow-rocm 2.3.1
  4. download text_classification and run

My environment:

OS: Ubuntu-20.04
CPU: Xeon 2620v3
GPU: RX580 8G (Polaris10) CHIP ID: 0x67df
Python: 3.8.5
Tensorflow-rocm: 2.3.1

@AsimPoptani
Copy link

How does one downgrade?

@rkothako
Copy link

Hi @AsimPoptani
There is no way to downgrade from a specific version.
You can uninstall existing version of ROCm and can install specific ROCm version from like http://repo.radeon.com/rocm/apt/3.x/
x means 9 or 8 or any other number

@jpsollie
Copy link

jpsollie commented Dec 12, 2020

I'm encountering the same issue here: benchmarking rocm 3.9 and 3.10 on a system with 2x R9 nano gpus is > 10x slower on ROCm compared to clover (if it runs at all):
ˋecho $(date +%%%s.%N%% && ./a.out && date +%%%s.%N%%) >> rocm.txtˋ
gives (executed 3 times):
ˋˋˋ
%1607587105.151533938% Result: 11168608085589920491 Runtime: 0.012519ms %1607587130.296327194% %1607670831.441274542% Result: 11168608085589920491 Runtime: 0.013072ms %1607670855.944835450% %1607670999.627702896% Result: 11168608085589920491 Runtime: 0.012555ms %1607671024.114541166%
ˋˋˋ
while on clover, it becomes:
ˋˋˋ
%1607525532.830965431% Result: 11168608085589920491 Runtime: 0.000692ms %1607525557.858665546% %1607525898.437019510% Result: 11168608085589920491 Runtime: 0.001562ms %1607525923.446692449% %1607525926.138453752% Result: 11168608085589920491 Runtime: 0.000700ms %1607525950.744559860%
ˋˋˋ
the benchmark was a slightly modified version of opencl-benchmark here on github, modified to ensure its correctness while calculating numbers + increased the load to get more accurate results. If you need it, let me know

@ROCmSupport
Copy link

Hi @jpsollie
Looks like its a different issue and I recommend to file a new ticket.
Thank you.

@ROCmSupport
Copy link

ROCmSupport commented Jan 4, 2021

AMD dropped supporting gfx8 officially from ROCm 4.0 as per https://github.com/RadeonOpenCompute/ROCm#hardware-and-software-support . But some things might work.
Hence closing this issue.
Thank you.

@AsimPoptani
Copy link

@ROCmSupport this seems a bit ridiculous as nvidia still support the 10 series graphics cards which are just as old...

@AsimPoptani
Copy link

Also @ROCmSupport link is broken

@ROCmSupport
Copy link

ROCmSupport commented Jan 4, 2021

It is https://github.com/RadeonOpenCompute/ROCm#hardware-and-software-support.
Clicking this link is failed to open. Request to copy and paste the link in a browser and enter.

@boriswinner
Copy link

Here is my guide to downgrade to ROCm 3.5.1 + TensorFlow 2.2:
https://github.com/boriswinner/RX580-rocM-tensorflow-ubuntu20.4-guide

@gabrielziegler3
Copy link

Has anyone on Arch uploaded a working ROCm build yet? I've been struggling with this now.

@9Tito
Copy link

9Tito commented Jun 17, 2021

I resolved this problem using this: https://githubmemory.com/repo/xuhuisheng/rocm-gfx803

but i have to add this: tf.compat.v1.disable_eager_execution()

HOWEBER, my routine without using the graphics card (with an AMD® Ryzen 5 2600 six-core processor × 12) is 2 times faster than using my graphics card :(. Is it normal? Is it because it is not the best graphics card for ML?

PD: not underground my ROCm, i don't know how and check witch version i have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants