docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335

buaimaoxiansheng · 2023-12-19T06:48:34Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

latest

Custom code

No

OS platform and distribution

Linux Ubuntu22.04.3

Mobile device

Linux Ubuntu22.04.3

Python version

3.9

Bazel version

No response

GCC/compiler version

gcc（Ubuntu 11.4.0-1ubuntu1~22.04）11.4.0

CUDA/cuDNN version

No response

GPU model and memory

MI210

Current behavior?

I'm using ubuntu22.04.3 with ROCm version 5.7.1

I want to run python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?

For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:

Use python./tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 The --batch_size=8 --num_gpus=4 command will cause keras or keras.api problems when using the --model=resnet50 parameter. The model can't be found.

The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.

Standalone code to reproduce the issue

I'm using ubuntu22.04.3 with ROCm version 5.7.1

I want to run python./tf_cnn_benchmarks.py \--forward_only=True \--data_name=imagenet \--model=resnet50 \--num_batches=50000 The \--batch_size=8 \--num_gpus=4 Inference test for MI210. In addition, could you please provide the training and testing method for NVIDIA's./ bencher.sh 0,1,2,3 tool?

For inference testing, I used docker to pull rocm/tensorflow:latest version, Running the tf_cnn_benchmarks.py file in python under /benchmarks/scripts/tf_cnn_benchmarks runs into two problems:

Use python./tf_cnn_benchmarks.py \--forward_only=True \--data_name=imagenet \--model=resnet50 \--num_batches=50000 The \--batch_size=8 \--num_gpus=4 command will cause keras or keras.api problems when using the \--model=resnet50 parameter. The model can't be found.

The second problem is that there was a tensorflow. The reasoning test. Python framework. Errors_impl. UnknownError: Failed to query the available memory for GPU zero error.

Relevant log output

No response

wenchenvincent · 2023-12-20T05:33:50Z

The root cause of the first issue is that the tf_cnn_benchmark in the rocm/tensorflow:latest (rocm5.7.1) might not have the latest commit. You can update it to the latest commit by "git pull". See if that helps.

For the second issue, I suspect that it was due to old driver. Could you run "rocm-smi" on the host and post the driver version?

buaimaoxiansheng · 2023-12-20T05:57:43Z

Thank you very much for your help. For the first question, I'll try. For your second problem solution, I checked the driverversion using the rocm-smi --showdriverversion command. The driver version is 5.15.0-91-generic.

wenchenvincent · 2023-12-20T06:00:43Z

Hi, the reported driver version was consistent with my suspicion.

We got the same error with driver version 5.13.20.22.10 and didn’t have any issues with driver version 6.2.4 or 6.3.4.

Could you try to upgrade your driver to see if helps?

buaimaoxiansheng · 2023-12-20T08:01:29Z

Ok,thanks.I'll try.I'll let you know when i test the driver and update it.

buaimaoxiansheng · 2023-12-20T08:09:24Z

In addition, could you please provide a tool for training and testing NVIDIA benchmark using./benchmark 0,1,2,3 commands?
After the test is completed, the data of each model test will appear, as shown in the following figure and table:

buaimaoxiansheng · 2023-12-21T03:41:16Z

Is the version you want to know the same as the one shown by rocm-smi --showdriverversion? I use rocm-smi --showdriverversion is the same as uname -r. Does it require a 6.2 kernel operating system?

wenchenvincent · 2023-12-21T04:55:08Z

Yes, it is the version shown by rocm-smi --showdriverversion. No, you don't need to upgrade the OS kernel. Just the AMD GPU kernel driver.

wenchenvincent · 2023-12-21T04:56:26Z

I am not sure of the ./benchmark 0,1,2,3 command that you were referring to... Was it a script from NVIDIA?

buaimaoxiansheng · 2023-12-21T05:20:42Z

Yes. benchmark tool in cuda, pull TensorFlow image in docker and use./benchmark.sh 0,1,2,3 to train and test four Gpus.

buaimaoxiansheng · 2023-12-21T05:32:22Z

wenchenvincent · 2023-12-21T06:04:26Z

https://rocm.docs.amd.com/en/docs-5.7.1/deploy/linux/os-native/upgrade.html

wenchenvincent · 2023-12-21T06:05:36Z

I suppose you can also use the script from NVIDIA when running tf_cnn_benchmarks on ROCm (with minor modification).

buaimaoxiansheng · 2023-12-21T06:25:18Z

hello. I need to change version=5.7 from the figure below to 6.2.4, right?

No. 5.7 is the rocm version. It is different from the amdgpu driver version. The page shows how to upgrade the amdgpu kernel (to latest) with rocm5.7.

buaimaoxiansheng · 2023-12-21T06:31:47Z

I know that ROCm HIP can convert cuda code into hip code, could you please provide the installation method and use method?

wenchenvincent · 2023-12-21T15:45:55Z

I don't think you need conversion from CUDA code to HIP code here. The benchmark script from NVIDIA probably does not have any CUDA code in it. It is probably a shell script for invoking benchmarking commands and processing the logs.

buaimaoxiansheng · 2023-12-22T01:00:16Z

How do I modify the code of the benchmark test script? I'm not good at writing code in deep learning.

buaimaoxiansheng · 2023-12-22T03:47:11Z

We are pulling the tensorflow image in the docker environment.
We use the ./benchmark.sh command in NVIDIA's lambda-tensorflow-benchmark file for the training benchmark. Regarding the training benchmarks, The command we use is ./benchmark.sh 0,1,2,3 (0,1,2,3 representative has four GPUs).
we use benchmarks-master of . / benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks py to push Manage the benchmark test. About inference benchmarks, . We are using the python ./benchmarks/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --data_name=imagenet --model=resnet50 --num_batches=50000 --batch_size=8 --num_gpus=4.
I have a question, these tools are the benchmark tools in cuda, can AMD MI210 drivers use these tools? Because NVIDIA's drivers are not the same as AMD's. Can you provide a benchmarking tool for training and reasoning on AMD GPU MI210?Thanks.

wenchenvincent · 2024-01-11T05:05:39Z

If you look at the content of the benchmark script, I suspect it is a shell script and it is platform independent. You can try that script on AMD GPUs.

wenchenvincent · 2024-01-11T05:09:29Z

The tf_cnn_benchmark is one way of benchmarking the CNN training and inference. But I think we're going to deprecate it soon.

When you say you're looking for benchmarking on AMD GPU, is there a performance metric that you're looking at? Also do the models need to be in Tensorflow specifically? Could the models be in Pytorch or ONNX?

buaimaoxiansheng · 2024-01-11T05:15:20Z

buaimaoxiansheng · 2024-01-11T05:26:14Z

Yes.

Yes. I am looking for performance metrics to test the GPU against models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet, etc. This performance metric refers to the number of images processed per second by training or reasoning models such as Vgg6, ssd300, resnet50, resnet152, ince4, ince3, alexnet. Models are not limited to tensorflow, but can also be found in Pytorch.

buaimaoxiansheng · 2024-01-19T01:27:54Z

If you don't have a benchmark for NV, do you have a test for TFLOPS? Or MLPerf?

wenchenvincent · 2024-01-19T22:58:16Z

@sunway513 Do you know if we have any public benchmarks for training and inference on MI200?

wenchenvincent self-assigned this Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335

docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335

buaimaoxiansheng commented Dec 19, 2023

wenchenvincent commented Dec 20, 2023

buaimaoxiansheng commented Dec 20, 2023

wenchenvincent commented Dec 20, 2023

buaimaoxiansheng commented Dec 20, 2023

buaimaoxiansheng commented Dec 20, 2023

buaimaoxiansheng commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

buaimaoxiansheng commented Dec 21, 2023

buaimaoxiansheng commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

buaimaoxiansheng commented Dec 21, 2023 •

edited by wenchenvincent

Loading

buaimaoxiansheng commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

buaimaoxiansheng commented Dec 22, 2023

buaimaoxiansheng commented Dec 22, 2023

wenchenvincent commented Jan 11, 2024

wenchenvincent commented Jan 11, 2024

buaimaoxiansheng commented Jan 11, 2024

buaimaoxiansheng commented Jan 11, 2024

buaimaoxiansheng commented Jan 19, 2024

wenchenvincent commented Jan 19, 2024

docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335

docker pull the latest version of the rocm/tensorflow image, use python to run the built-in /tf_cnn_benchmarks.py to perform inference testing on MI210 #2335

Comments

buaimaoxiansheng commented Dec 19, 2023

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

wenchenvincent commented Dec 20, 2023

buaimaoxiansheng commented Dec 20, 2023

wenchenvincent commented Dec 20, 2023

buaimaoxiansheng commented Dec 20, 2023

buaimaoxiansheng commented Dec 20, 2023

buaimaoxiansheng commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

buaimaoxiansheng commented Dec 21, 2023

buaimaoxiansheng commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

buaimaoxiansheng commented Dec 21, 2023 • edited by wenchenvincent Loading

buaimaoxiansheng commented Dec 21, 2023

wenchenvincent commented Dec 21, 2023

buaimaoxiansheng commented Dec 22, 2023

buaimaoxiansheng commented Dec 22, 2023

wenchenvincent commented Jan 11, 2024

wenchenvincent commented Jan 11, 2024

buaimaoxiansheng commented Jan 11, 2024

buaimaoxiansheng commented Jan 11, 2024

buaimaoxiansheng commented Jan 19, 2024

wenchenvincent commented Jan 19, 2024

buaimaoxiansheng commented Dec 21, 2023 •

edited by wenchenvincent

Loading