AMD GPU Support Plan #4127

abuccts · 2020-01-09T09:31:44Z

Support AMD GPU in PAI:

docs on how to deploy rocm drivers and container runtime
gpu scheduling for jobs Add support for AMD GPU #4093
gpu metrics in exporter Add amd metrics support #4258
rocm job examples on how to use amd gpu Add support for AMD GPU #4093

Currently we support AMD GPUs in hivedscheduler only.
~~- [ ] specify amd gpu in protocol when using default scheduler~~
~~- [ ] support amd metrics (job exporter, vc in rest server) when using default scheduler~~

Binyang2014 · 2020-02-25T02:32:10Z

To show GPU metrics in Grafana, we need to add these metrics for AMD:

nvidiasmi_utilization_gpu
nvidiasmi_utilization_memory
task_gpu_percent
task_gpu_mem_percent

For Alerts, we need to add these metrics:

cmd_nvidia_smi_latency_seconds_bucket
nvidiasmi_ecc_error_count
nvidiasmi_memory_leak_count
zombie_process_count
gpu_used_by_external_process_count
gpu_used_by_zombie_container_count
nvidiasmi_utilization_gpu:count

Will treat adding Grafana related metrics as high priority. And starting working on it.
For alters related metrics, may implement later.

Binyang2014 · 2020-02-27T09:57:13Z

For Grafana panel: rocm-smi provide all metrics we needed
For Alert:
Following alerts are supports:

cmd_rocm_smi_latency_seconds_bucket
rocm_memory_leak_count
rocmsmi_utilization_gpu:count

Following alerts are not support, due to issue ROCm/ROC-smi#60

To collect following metrics, we need to known the process running on each GPU. But rocm-smi only provide the process ids using GPU, not specific which process use which GPU.

For following metrics, we can only can get the GPU is used by zombie process or external process but will not get detail info, such as "The GPU with minor number 0 is used by external process"

zombie_process_count
gpu_used_by_external_process_count
gpu_used_by_zombie_container_count

Following alert is not support, due to lack of retired pages details
rocm-smi provide a flag –showretiredpages, but lack of details

nvidiasmi_ecc_error_count

The feature requirement for AMD

Provide the process id running on each GPU
Provide clean doc for rocm-smi flags, such as sample output.

Binyang2014 · 2020-03-12T09:13:26Z

Currently, we don't support AMD metrics when using default scheduler.
The reason is there is no env like NVIDIA_VISIBLE_DEVICES to tell the GPU index mounted in container. Refer to ROCm/ROCm#994

Maybe we can use docker inspect and rocm-smi to get the GPU index mapping, but for now, will not support for default scheduler.

fanyangCS self-assigned this Jan 15, 2020

fanyangCS assigned Binyang2014 and abuccts and unassigned fanyangCS Feb 16, 2020

hzy46 mentioned this issue Feb 25, 2020

PAI 0.18.0 Release and Endgame Plan #4225

Closed

74 tasks

scarlett2018 added the pai-dev label Apr 17, 2020

hzy46 mentioned this issue Apr 27, 2020

add release note #4452

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD GPU Support Plan #4127

AMD GPU Support Plan #4127

abuccts commented Jan 9, 2020 •

edited by fanyangCS

Loading

Binyang2014 commented Feb 25, 2020

Binyang2014 commented Feb 27, 2020 •

edited

Loading

Binyang2014 commented Mar 12, 2020

AMD GPU Support Plan #4127

AMD GPU Support Plan #4127

Comments

abuccts commented Jan 9, 2020 • edited by fanyangCS Loading

Binyang2014 commented Feb 25, 2020

Binyang2014 commented Feb 27, 2020 • edited Loading

Binyang2014 commented Mar 12, 2020

abuccts commented Jan 9, 2020 •

edited by fanyangCS

Loading

Binyang2014 commented Feb 27, 2020 •

edited

Loading