Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event leaks #1250

Closed
ekuznetsov139 opened this issue Aug 24, 2019 · 3 comments
Closed

Event leaks #1250

ekuznetsov139 opened this issue Aug 24, 2019 · 3 comments

Comments

@ekuznetsov139
Copy link

ekuznetsov139 commented Aug 24, 2019

I've noticed that ROCm uses an astounding number of event objects when used as backend for https://github.com/tensorflow/tensor2tensor. It easily overshoots the kernel default of 4096, triggering warnings in dmesg (see ROCm/ROCK-Kernel-Driver#80) and, according to my traces, the number of InterruptSignal (ROCR-Runtime/src/core/runtime/interrupt_signal.cpp) objects stabilizes somewhere in the neighborhood of 60,000. Which is very hard to explain as anything but a resource leak.

Most of those are created via hsa_signal_create or Kalmar::ctx->getSignal(), and owned by HSAOp objects created in lib/hsa/mcwamp_hsa.cpp, of which there are, likewise, tens of thousands at any given time. It seems that they don't get released promptly when they are done executing, maybe because they form daisy chains of references.

@ekuznetsov139
Copy link
Author

This looks almost like a design feature: HSAOp's live in a ring buffer HSAQueue::asyncOps

https://github.com/RadeonOpenCompute/hcc/blob/clang_tot_upgrade/lib/hsa/mcwamp_hsa.cpp#L1281-L1282

which is set to 16384 entries by default, and no attempt to do resource reclamation is made until the pointer gets all the way around the buffer.

It is a trivial change to turn that std::vector into std::list and to add garbage collection (any time a new op is pushed to the tail of the list, remove all finished ops from its head), which drops the maximum number of simultaneously coexisting ops in my runs from 60,000 to 4,000.

There does not seem to be much need for that buffer in the first place, since no one ever does anything with entries in it, apart from checking the last enqueued op and counting the number of pending ops.

@ekuznetsov139
Copy link
Author

ekuznetsov139 commented Aug 25, 2019

https://github.com/RadeonOpenCompute/hcc/compare/clang_tot_upgrade...ekuznetsov139:event_fix2?expand=1

And, in fact, I see that there's a change in master (but not in the 2.7 tag I've been working with) that goes even further and simply empties the entire queue on every wait.

However, things are not as peachy as I'd like, because, with my change, I am now getting odd crashes:

### HCC STATUS_CHECK Error: Unknown Error Code (0x29) at file:mcwamp_hsa.cpp line:1189

#2  0x00007f292c1708f4 in Kalmar::RocrQueue::callbackQueue(hsa_status_t, hsa_queue_s*, void*) () at /home/zen/rocm/hcc/lib/hsa/mcwamp_hsa.cpp:1189
#3  0x00007f2965733e24 in AMD::callback_t<void (*)(hsa_status_t, hsa_queue_s*, void*)>::operator()(hsa_status_t, hsa_queue_s*, void*) ()
   from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#4  0x00007f2965731adf in amd::AqlQueue::DynamicScratchHandler(long, void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f29657754bf in core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f296570970e in os::ThreadTrampoline(void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1

They are random and fairly rare (took me several tries and 30+ min of training to capture one in gdb).

I don't see how it could be my doing - I suspect an unrelated bug in resource refcounting that got exposed because objects are getting released sooner than before.

Going to rebuild with 346267d#diff-447101ed1788631fe8f3e89117d084f2 and see if that takes care of it.

@jeffdaily
Copy link
Collaborator

You cannot simply stop allocating signals for all kernels. In the current ring buffer implementation, if the buffer loops back around and overwrites a kernel op before it has launched, the device-allocated kernel argument buffer will be free'd before the kernel uses it.

We are aware of the large number of signals being used and we have plans to fix it. In the meantime, perhaps try reducing the size of the ring buffer using HCC_ASYNCOPS_SIZE env var. The default is 16K. Or if you're comfortable rebuilding HCC yourself, try cherry-picking the queue flush on wait. a9eec7b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants