Processes using a GPU #60

robzor92 · 2019-03-22T17:05:39Z

In nvidia-smi you can see what processes are running on a certain GPU, including how much memory each process allocated. I do not see it being available in rocm-smi. I see #42 asks specifically for memory usage, but I think more interesting for debug reasons would be which processes and list the memory used for each process. Is it on the roadmap?

kentrussell · 2019-03-25T13:10:31Z

Currently the PID is only stored per VM in the kernel, not per GPU. So there isn't an existing interface to make that happen, so it would need to be done in the kernel first before it can make its way up to ROCm and the SMI. I am checking to see if it's on the kernel roadmap. The kernel doesn't store how much VRAM is used per process, so that would take more work to implement.

jlgreathouse · 2019-03-25T15:47:49Z

I'll note that what processes are actually running on the GPU at any particular point in time is completely divorced from either user-space or the kernel. AMD GPUs using the ROCm software stack use a hardware scheduler (HWS) block to decide what is running during particular GPU timeslices. The kernel tells the HWS when there is a new work queue from a process, and thus the HWS knows which processes can request that work run on the GPU.

However, the HWS fully and independently chooses which of those queues to pull work from. It can run multiple kernels from any particular queue (if the software requests this), it can have multiple queues from any particular processes assigned to the GPU, and it can run kernel from multiple processes at any one time. The HWS also initiates task switching of kernels (CWSR) so that one kernel cannot take over the GPU permanently. The HWS thus decides when to stop a kernel from running, and it also makes the decisions about when to switch out queues or processes to allow other things to run.

In addition, putting work into a work queue can be done without going through the kernel at all (known as user-mode queuing). As such, while the kernel must know when a queue is first created, it does not necessarily know when any work is actually being enqueued to the GPU (and, as described above, it does not know when the GPU actually starts running that work).

So while we may be able to tell you what processes can assign work to each GPU, it would be much more difficult to tell you exactly what is running on each GPU at any particular point in time.

kentrussell · 2019-04-18T12:48:55Z

So that you don't feel completely dismayed about this request, we are trying to scope the work required to make this information available. It's not guaranteed to be available or doable, but we're currently seeing if it's possible and how to make it happen if it is indeed possible to do. And thanks @jlgreathouse for that explanation and clarification!

robzor92 · 2019-04-23T08:36:54Z

Thanks for the responses! I was able to solve the specific issue I needed this information to debug but if you are able to provide this functionality it will come in handy in the future.

kentrussell · 2019-05-09T20:10:36Z

It looks like we may get this into 2.6 . I'm currently working on it. Worth keeping your eyes open for it

robzor92 · 2019-07-25T12:35:54Z

@kentrussell Was it included in 2.6? I took a look at the docs but did not find it mentioned. Anyway, looking forward to it whenever it hits.

kentrussell · 2019-07-25T13:02:04Z

Jeez, I didn't update the master branch when 2.6 was released... It's in the roc-2.6.x branch . I'll update master now to reflect that, but you can checkout roc-2.6.x to try it out.

I should clarify, we don't have the GPU usage or memory usage yet per-PID. Right now we've only got the PIDs associated with a compute queue right now. The other parts are on the roadmap, not sure which release yet though. Hopefully in the next 1-3 releases though

kentrussell · 2020-04-28T12:09:07Z

This was added in 3.3 at long last (with support for identifying the node attached to a specific process).

Binyang2014 mentioned this issue Feb 27, 2020

AMD GPU Support Plan microsoft/pai#4127

Open

4 tasks

kentrussell closed this as completed Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processes using a GPU #60

Processes using a GPU #60

robzor92 commented Mar 22, 2019

kentrussell commented Mar 25, 2019

jlgreathouse commented Mar 25, 2019

kentrussell commented Apr 18, 2019

robzor92 commented Apr 23, 2019

kentrussell commented May 9, 2019

robzor92 commented Jul 25, 2019

kentrussell commented Jul 25, 2019 •

edited

Loading

kentrussell commented Apr 28, 2020

Processes using a GPU #60

Processes using a GPU #60

Comments

robzor92 commented Mar 22, 2019

kentrussell commented Mar 25, 2019

jlgreathouse commented Mar 25, 2019

kentrussell commented Apr 18, 2019

robzor92 commented Apr 23, 2019

kentrussell commented May 9, 2019

robzor92 commented Jul 25, 2019

kentrussell commented Jul 25, 2019 • edited Loading

kentrussell commented Apr 28, 2020

kentrussell commented Jul 25, 2019 •

edited

Loading