Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver running out of events #80

Closed
ekuznetsov139 opened this issue Aug 23, 2019 · 7 comments
Closed

Driver running out of events #80

ekuznetsov139 opened this issue Aug 23, 2019 · 7 comments

Comments

@ekuznetsov139
Copy link

There have been a few reports (ROCm/ROCm#748, ROCm/ROCm#858) of ROCm-backed apps crashing or otherwise failing in diverse ways, while "Signal event wasn't created because limit was reached" is reported to the kernel log.

The problem occurs because the memory for signal events is exhausted. The limit on events is in include/uapi/linux/kfd_ioctl.h#L263.

It is exhausted, because events are created (via the API call hsaKmtCreateEvent() ) under circumstances as varied as async memcpy's and inter-stream dependencies. A sufficiently large tensorflow graph (e.g. anything transformer-based) can easily exceed 4096.

Unless there's a very good reason why it can't be done, I believe the limit should be increased. Can we do 32k?

@ekuznetsov139
Copy link
Author

To be fair, it looks like ROCm is also leaking events like a sieve. I've managed to rebuild the upstream driver and libhsakmt.so with the cap raised to 32768, and tensorflow promptly hit that cap too. Events are not being released after use at all. I'll check with the 2.7 release to see if it happens there too.

@fxkamd
Copy link
Contributor

fxkamd commented Aug 23, 2019

Exhausing the event limit should not lead to failures. When we're out of events it just means we can't wait for events by sleeping any more. We have to fall back to polling. The "event limit reached" message is probably completely unrelated to the crashes you're seeing.

@ekuznetsov139
Copy link
Author

Hmm, okay. The problem I am seeing is either a synchronization failure or a random silent kernel launch failure, but maybe it is unrelated to the event count. Where exactly is the logic that falls back to waiting for events by polling?

@fxkamd
Copy link
Contributor

fxkamd commented Aug 23, 2019

The polling fallback is implemented in the ROCr runtime. Sorry I can't be any more specific, I don't know my way around that code well enough.

@ekuznetsov139
Copy link
Author

I've confirmed the existence of the polling fallback.

With vanilla 2.7, I am no longer able to reproduce my crash; its origin is probably going to remain a mystery. However, 2.7 is still leaking events. I've closed this issue, and I'll raise another concerning the event leak in the appropriate repo after I figure out who exactly is responsible for it.

@949f45ac
Copy link

949f45ac commented Aug 29, 2019

When we're out of events it just means we can't wait for events by sleeping any more. We have to fall back to polling.

Right, so from that time on, one full CPU core is blocked. I don’t believe that behaviour is fine at all!
As laid out in my issue (ROCm/ROCm#748), somehow ROCm 2.0 was well able to never revert to polling.
Maybe a large number of GPU compute use cases are just about running for some finite time until a result is found – still there are programs like miners which are meant to run indefinitely. If a workaround exists in restarting the miner every 2 minutes (as opposed to rebooting the system), then why can’t the ROCm runtime get rid of these events just by itself, while it’s being used?

@ekuznetsov139
Copy link
Author

I also see the interrupt ring in kfd_interrupt.c (8192 entries) being persistently overflowed, producing dmesg errors "Interrupt ring overflow, dropping interrupt 0". Is that one harmless as well?

@949f45ac They do have a fix for excessive event use, it's very recent so it's not in 2.7 (maybe it'll be in 2.8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants