-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335
Comments
The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps. For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version? |
Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic. |
Hi, the reported driver version was consistent with my suspicion. We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4. Could you try to upgrade your driver to see if helps? |
Ok,thanks.I'll try.I'll let you know when i test the driver and update it. |
Is the version you want to know the same as the one shown by rocm-smi --showdriverversion? I use rocm-smi --showdriverversion is the same as uname -r. Does it require a 6.2 kernel operating system? |
Yes, it is the version shown by |
https://rocm.docs.amd.com/en/docs-5.7.1/deploy/linux/os-native/upgrade.html |
hello. I need to change version=5.7 from the figure below to 6.2.4, right? No. 5.7 is the rocm version. It is different from the amdgpu driver version. The page shows how to upgrade the amdgpu kernel (to latest) with rocm5.7. |
@sunway513 Do you know if we have any public benchmarks for training and inference on MI200? |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
latest
Custom code
No
OS platform and distribution
Linux Ubuntu22.04.3
Mobile device
Linux Ubuntu22.04.3
Python version
3.9
Bazel version
No response
GCC/compiler version
gcc(Ubuntu 11.4.0-1ubuntu1~22.04)11.4.0
CUDA/cuDNN version
No response
GPU model and memory
MI210
Current behavior?
I'm using ubuntu22.04.3 with ROCm version 5.7.1
I want to run python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?
For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:
Use python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 command will cause keras or keras.api problems when using the --model=resnet50 parameter. The model can't be found.
The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.
Standalone code to reproduce the issue
Relevant log output
No response
The text was updated successfully, but these errors were encountered: