-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow 2.0 AMD support #362
Comments
Hi @Cvikli , we are finalizing the 2.0-alpha docker image and will be available soon, please stay tuned. |
Hi @Cvikli , we've pushed out the preview build docker image for TF2.0-alpha0: |
Great! Thank you for the fast work! I am really excited about it! |
Please open a new issue if bugs are found with the 2.0 docker. |
Sorry for opening the thread but I own you guys with a lot! The RADEON VII's performance is crazy with tensorflow 2.0a. We are glad to open our eyes towards AMD products, we are buying our first configuration which is 40% cheaper and as we measured capable to perform better in our scenario than our well optimised server configuration. Thank you for all the work! |
Could you give a bit more detail? How much faster is Radeon VII for your application? What type of mode are you running (CNN/RNN/GAN/etc.)? What processor are you running? Just curious. |
Thank you @Cvikli , great to hear that your experiment went well and you are going to invest more on ROCm and AMD GPUs! |
The system is something like this:
The result with RNN networks on 1 Radeon VII and 1080ti was close to the same. Now after we switched over to 4 Radeon VII, we face two big scaling issue on convolutional networks.
We are pretty sure things should work, because it was working with NVidia 1080ti. However inspite of it writes, that it failed to allocate the memory, the whole program just start and somehow running normally I think. Can this happen because of the docker image, we can't use separate GPUs for different runs?
What do you guys think about this? Is this normal that we get 10x slower speed when it comes to cudNN? (For me cuDNN sounds totally a software with better arithmetic operations I guess, is it possible to improve on this?) |
Hi @Cvikli , let's step back a bit and look at your system configuration:
The typical gold workstation power supply would run at 87% efficiency at full load, therefore it can supposedly power up to 1307W.
The above error message indicates the target GPU device memory has already been allocated by the other processes.
We recommend approach #3, as that would isolate the GPUs at a relatively lower level of the ROCm stack. For your concern on mGPU performance, could you provide the exact commands to reproduce your observations? Just FYI, we have been actively running regressions tests for single node multi-GPU performance, and there's no mGPU performance regression issue reported for TF1.13 on ROCm2.4 release. |
hank you for the 3 different ways to manage visible devices. Ran some test on TF2.0 on ROCm2.4 and performance is still a lot lower than what an Nvidia 1080Ti can provide benchmarking on MobileNetv2, what bothers us yet a little.
So I pretty much feel like we are running some operations 19 times, which leads to 10-15x speed loss, but it is only a guess. If I can help in any other way let me know. PS.: on TF2.0 ROCm2.4, I couldn't run the tf_cnn_benchmarks.py because missing tensorflow.contrib. |
Hi @Cvikli , glad the ROCr env var worked for you!
The above logs indicate the time spent there was actually for MIOpen to compile kernels, please refer to my previous comment here for reference. |
Besides, if your application is built on TF1.x api, you might use the following TF1.13 release instead of using TF2.0 branch built with --config=v1: |
We ported our code from tf2.0 to tf1.13 and run the MobileNetV2 implementation from tf.keras.applications on the configuration you suggested (TF1.13 on ROCm2.4 release), and we still see NO improvement in speed. |
Hi @Cvikli , could you provide the exact steps to repro your observation? |
Anyone tested with the latest Macbook pros? |
I run into the error "failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)" as above. I have not tried install tensorflow-rocm through docker. Any help? |
Hi @quocdat32461997 , can you try to set the following environment variables: |
Problem solved by re-installing ROCm and Tensorflow-rocm. Proabably I did not install the ROCm properly. Thanks a lot. |
Hey there! |
Hi @Cvikli , we are preparing the TF2.0 beta release, it's currently under QA test coverage. |
You guys, you are crazy! I love it! :) Thank you for this speed! |
Looks like the link at the beginning of the thread redirects to https://hub.docker.com, here's the link I'm using to track releases: https://hub.docker.com/r/rocm/tensorflow/tags |
Hi @Cvikli , we have published the docker container for TF-ROCm 2.0 Beta1. Please kindly check it and let us know if you have any questions: |
Hi everyone, I am using a rx 480 with rocm 2.5 and rocm with tensorflow 1.13 works fine. |
Hi @moonshine502 , I've tried a couple of samples using the rocm2.5-tf2.0-beta1-config-v2 docker image on my GFX803 node, those are working fine. |
Hi @sunway513, Hardware: Intel Celeron G3900 (Skylake), AMD Radeon RX 480 (gfx803)
Issue: I am guessing that this error is caused by the cpu not being compatible with the new tensorflow version. Could this be the case? |
@moonshine502 I'm running almost the exact same system setup and its able to load and train for me. The only difference appears to be the CPU, or possibly the card. I'm using a Ryzen 5 2400G; everything else looks near the same. I'm using a RX560 14cu, which registers in linux as an RX480 (gfx803), ROCM 2.5.27. I ran through all the steps for training a mnist dataset at the link below to confirm tf2.0 was actually working, the accuracy for the evaluation wasn't the best (~87.7%) vs (98%) but it was able to compute. https://www.tensorflow.org/beta/tutorials/quickstart/beginner Edit: included more info. |
Hi @dundir, @sunway513, I am now pretty sure that the cause of the problem is my cpu which does not support avx instructions. It seems that previous versions of tensorflow with rocm were compiled without avx, because they work on my machine. So I may try to build tensorflow 2.0 without avx or get a new cpu. Thank you for your help. |
Memory being the bottleneck, can we do bfloat16 and int8, float8, float16? Just curious |
cuDNN is not purely software play and is backed by actual silicon (dedicated tensor cores for MAD ops) which boosts half-precision performance. I'll need to check if Radeon VII has dedicated tensor cores as well. Also, nvidia won't automatically optimize code to make use of tensor cores, that has to be done w/ using cuDNN extensions |
@salmanulhaq 1080Ti has no tensor cores. |
do u have a referece for hardware being involved in CUDNN? CUDNN afaik is pure software play with optimization and what not , what u may be referring to is TENSOR cores which was added to packaged on Volta and carried to Turing silicons. |
Anybody tried TF 2.0 with a Radeon RX 580, with 8GB RAM? Does it work? If it does, has anybody tried running multiple cards in parallel? I have one of the first generation Nvidia Titan X cards (pre-pascal). I'm finally giving up on it. It can only run CUDA drivers from a long time ago, from the year the card first was produced. Anything newer, I've tried them all, and the card won't initialize (i.e. - O/S rejects it at the device level). Very sad about this since I pad a ton for it, but it's time to move on. |
It ought to work but I'm not convinced that there's a point in running multiple 580s on a single training task. I don't think they'd be fast enough to gain a meaningful speedup (I didn't test rocm, but in a rendering task between a VII and a 580, it was faster to just use the VII than to have them both work together). |
Can anyone reply to @quantuminformation question please? |
I've now upgraded to the new MBP 16, but not used TFJS for a while, might get into py soon. |
Hi @quantuminformation @kuabhish , please refer to the following doc for ROCm support coverage over OSes: |
Hi Cvikli, I am having radeon-vii but not able to configure with tensorflow. Please guide me. I was struggling to configure this for more than 15 days. Can I use the my gpu without docker ? Can i use the tensorflow 1.x with gpu. I had installed the rocm but still gpu is bot responding while training my model. My system config: |
Hi @sumannelli , did you follow the following instructions to install TF? And certainly, you can use your GPU without docker, that's just a matter of deployment approach -- using docker would likely help you save some time config the user bit environment with ROCm. |
HI Sunway513, |
Hi, Guys |
HI @Sifatul22 , your configuration should work. |
@sunway513 Is Navi now supported? Radeon RX 5500 XT is Navi, isn't it? |
Hi @briansp2020 , Navi is not supported by ROCm yet, please refer to the following document for the GPU GPU list supported by ROCm: |
Hi sunway513, |
Hi @sumannelli , in the same document, if you follow the steps to install python3 dependencies, depends on the default python3 version you have in your environment, you should be able to configure it correctly. |
@hi sunway513, Thanks for the reply Now I can run tensorflow2 on AMD radeon Vii. During handling of the above exception, another exception occurred: Traceback (most recent call last): Failed to load the native TensorFlow runtime. See https://www.tensorflow.org/install/errors for some common reasons and solutions. Include the entire stack trace Thanks |
Hi sunway513, sudo apt install rocm-dkms |
Hi @sumannelli-Ib , you can use the following for ROCm2.10 package:
Then
If you want to stick with ROCm3.1, you need to pull the latest tensorflow-rocm whl packages, please consult with our document below: In the future, please make sure your tensorflow-rocm version is compatible to the ROCm build installed on your system, the compatibility info: |
Hi @sunway, I tried as mentioned above but getting the below error when trying to train the model(Using TensorFlow object detection API) . warning: :0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering |
Hi @sumannelli-Ib , it seems your system is still on ROCm3.10, is that what you want? |
Hi Dealing with docker is hectic work for me. I never used before it. |
Hi @sumannelli , you are welcome! However, I don't think we have brought up TFOD API for tensorflow-rocm project yet. |
But tensorflow-rocm==1.15.0 is working where tensorflow-rocm=1.15.3 showing above error |
Hi @sumannelli , can you open a new issues and provide us the steps to reproduce the problem? ROCm enabled relocatable package feature in ROCm3.1.0, that feature may introduce the regressions you've observed for TFOD API. |
Hi @sunway513, It's been 2 two days I have raised an issue, but nobody is assigned. Could you please help me with the below new link. |
I would be curious if Tensorflow 2.0 works with AMD Radeon VII?
Also, if it is available, are there any benchmark comparison with 2080Ti on some standard network to see if we should invest in Radeon VII clusters?
The text was updated successfully, but these errors were encountered: