Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

AMD GPU Support Plan #4127

Open
3 of 4 tasks
abuccts opened this issue Jan 9, 2020 · 3 comments
Open
3 of 4 tasks

AMD GPU Support Plan #4127

abuccts opened this issue Jan 9, 2020 · 3 comments
Assignees
Labels

Comments

@abuccts
Copy link
Member

abuccts commented Jan 9, 2020

Support AMD GPU in PAI:

Currently we support AMD GPUs in hivedscheduler only.
- [ ] specify amd gpu in protocol when using default scheduler
- [ ] support amd metrics (job exporter, vc in rest server) when using default scheduler

@fanyangCS fanyangCS self-assigned this Jan 15, 2020
@fanyangCS fanyangCS assigned Binyang2014 and abuccts and unassigned fanyangCS Feb 16, 2020
@Binyang2014
Copy link
Contributor

To show GPU metrics in Grafana, we need to add these metrics for AMD:

nvidiasmi_utilization_gpu
nvidiasmi_utilization_memory
task_gpu_percent
task_gpu_mem_percent

For Alerts, we need to add these metrics:

cmd_nvidia_smi_latency_seconds_bucket
nvidiasmi_ecc_error_count
nvidiasmi_memory_leak_count
zombie_process_count
gpu_used_by_external_process_count
gpu_used_by_zombie_container_count
nvidiasmi_utilization_gpu:count

Will treat adding Grafana related metrics as high priority. And starting working on it.
For alters related metrics, may implement later.

@Binyang2014
Copy link
Contributor

Binyang2014 commented Feb 27, 2020

For Grafana panel: rocm-smi provide all metrics we needed
For Alert:
Following alerts are supports:

cmd_rocm_smi_latency_seconds_bucket
rocm_memory_leak_count
rocmsmi_utilization_gpu:count

Following alerts are not support, due to issue ROCm/ROC-smi#60

To collect following metrics, we need to known the process running on each GPU. But rocm-smi only provide the process ids using GPU, not specific which process use which GPU.

For following metrics, we can only can get the GPU is used by zombie process or external process but will not get detail info, such as "The GPU with minor number 0 is used by external process"

zombie_process_count
gpu_used_by_external_process_count
gpu_used_by_zombie_container_count

Following alert is not support, due to lack of retired pages details
rocm-smi provide a flag –showretiredpages, but lack of details

nvidiasmi_ecc_error_count

The feature requirement for AMD

  1. Provide the process id running on each GPU
  2. Provide clean doc for rocm-smi flags, such as sample output.

@Binyang2014
Copy link
Contributor

Currently, we don't support AMD metrics when using default scheduler.
The reason is there is no env like NVIDIA_VISIBLE_DEVICES to tell the GPU index mounted in container. Refer to ROCm/ROCm#994

Maybe we can use docker inspect and rocm-smi to get the GPU index mapping, but for now, will not support for default scheduler.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants