-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processes using a GPU #60
Comments
Currently the PID is only stored per VM in the kernel, not per GPU. So there isn't an existing interface to make that happen, so it would need to be done in the kernel first before it can make its way up to ROCm and the SMI. I am checking to see if it's on the kernel roadmap. The kernel doesn't store how much VRAM is used per process, so that would take more work to implement. |
I'll note that what processes are actually running on the GPU at any particular point in time is completely divorced from either user-space or the kernel. AMD GPUs using the ROCm software stack use a hardware scheduler (HWS) block to decide what is running during particular GPU timeslices. The kernel tells the HWS when there is a new work queue from a process, and thus the HWS knows which processes can request that work run on the GPU. However, the HWS fully and independently chooses which of those queues to pull work from. It can run multiple kernels from any particular queue (if the software requests this), it can have multiple queues from any particular processes assigned to the GPU, and it can run kernel from multiple processes at any one time. The HWS also initiates task switching of kernels (CWSR) so that one kernel cannot take over the GPU permanently. The HWS thus decides when to stop a kernel from running, and it also makes the decisions about when to switch out queues or processes to allow other things to run. In addition, putting work into a work queue can be done without going through the kernel at all (known as user-mode queuing). As such, while the kernel must know when a queue is first created, it does not necessarily know when any work is actually being enqueued to the GPU (and, as described above, it does not know when the GPU actually starts running that work). So while we may be able to tell you what processes can assign work to each GPU, it would be much more difficult to tell you exactly what is running on each GPU at any particular point in time. |
So that you don't feel completely dismayed about this request, we are trying to scope the work required to make this information available. It's not guaranteed to be available or doable, but we're currently seeing if it's possible and how to make it happen if it is indeed possible to do. And thanks @jlgreathouse for that explanation and clarification! |
Thanks for the responses! I was able to solve the specific issue I needed this information to debug but if you are able to provide this functionality it will come in handy in the future. |
It looks like we may get this into 2.6 . I'm currently working on it. Worth keeping your eyes open for it |
@kentrussell Was it included in 2.6? I took a look at the docs but did not find it mentioned. Anyway, looking forward to it whenever it hits. |
Jeez, I didn't update the master branch when 2.6 was released... It's in the roc-2.6.x branch . I'll update master now to reflect that, but you can checkout roc-2.6.x to try it out. I should clarify, we don't have the GPU usage or memory usage yet per-PID. Right now we've only got the PIDs associated with a compute queue right now. The other parts are on the roadmap, not sure which release yet though. Hopefully in the next 1-3 releases though |
This was added in 3.3 at long last (with support for identifying the node attached to a specific process). |
In nvidia-smi you can see what processes are running on a certain GPU, including how much memory each process allocated. I do not see it being available in rocm-smi. I see #42 asks specifically for memory usage, but I think more interesting for debug reasons would be which processes and list the memory used for each process. Is it on the roadmap?
The text was updated successfully, but these errors were encountered: