Event leaks #1250

ekuznetsov139 · 2019-08-24T10:01:28Z

I've noticed that ROCm uses an astounding number of event objects when used as backend for https://github.com/tensorflow/tensor2tensor. It easily overshoots the kernel default of 4096, triggering warnings in dmesg (see ROCm/ROCK-Kernel-Driver#80) and, according to my traces, the number of InterruptSignal (ROCR-Runtime/src/core/runtime/interrupt_signal.cpp) objects stabilizes somewhere in the neighborhood of 60,000. Which is very hard to explain as anything but a resource leak.

Most of those are created via hsa_signal_create or Kalmar::ctx->getSignal(), and owned by HSAOp objects created in lib/hsa/mcwamp_hsa.cpp, of which there are, likewise, tens of thousands at any given time. It seems that they don't get released promptly when they are done executing, maybe because they form daisy chains of references.

ekuznetsov139 · 2019-08-25T00:02:59Z

This looks almost like a design feature: HSAOp's live in a ring buffer HSAQueue::asyncOps

https://github.com/RadeonOpenCompute/hcc/blob/clang_tot_upgrade/lib/hsa/mcwamp_hsa.cpp#L1281-L1282

which is set to 16384 entries by default, and no attempt to do resource reclamation is made until the pointer gets all the way around the buffer.

It is a trivial change to turn that std::vector into std::list and to add garbage collection (any time a new op is pushed to the tail of the list, remove all finished ops from its head), which drops the maximum number of simultaneously coexisting ops in my runs from 60,000 to 4,000.

There does not seem to be much need for that buffer in the first place, since no one ever does anything with entries in it, apart from checking the last enqueued op and counting the number of pending ops.

ekuznetsov139 · 2019-08-25T08:30:51Z

https://github.com/RadeonOpenCompute/hcc/compare/clang_tot_upgrade...ekuznetsov139:event_fix2?expand=1

And, in fact, I see that there's a change in master (but not in the 2.7 tag I've been working with) that goes even further and simply empties the entire queue on every wait.

However, things are not as peachy as I'd like, because, with my change, I am now getting odd crashes:

### HCC STATUS_CHECK Error: Unknown Error Code (0x29) at file:mcwamp_hsa.cpp line:1189

#2  0x00007f292c1708f4 in Kalmar::RocrQueue::callbackQueue(hsa_status_t, hsa_queue_s*, void*) () at /home/zen/rocm/hcc/lib/hsa/mcwamp_hsa.cpp:1189
#3  0x00007f2965733e24 in AMD::callback_t<void (*)(hsa_status_t, hsa_queue_s*, void*)>::operator()(hsa_status_t, hsa_queue_s*, void*) ()
   from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#4  0x00007f2965731adf in amd::AqlQueue::DynamicScratchHandler(long, void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f29657754bf in core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f296570970e in os::ThreadTrampoline(void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1

They are random and fairly rare (took me several tries and 30+ min of training to capture one in gdb).

I don't see how it could be my doing - I suspect an unrelated bug in resource refcounting that got exposed because objects are getting released sooner than before.

Going to rebuild with 346267d#diff-447101ed1788631fe8f3e89117d084f2 and see if that takes care of it.

jeffdaily · 2019-08-26T20:28:21Z

You cannot simply stop allocating signals for all kernels. In the current ring buffer implementation, if the buffer loops back around and overwrites a kernel op before it has launched, the device-allocated kernel argument buffer will be free'd before the kernel uses it.

We are aware of the large number of signals being used and we have plans to fix it. In the meantime, perhaps try reducing the size of the ring buffer using HCC_ASYNCOPS_SIZE env var. The default is 16K. Or if you're comfortable rebuilding HCC yourself, try cherry-picking the queue flush on wait. a9eec7b

ekuznetsov139 closed this as completed Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event leaks #1250

Event leaks #1250

ekuznetsov139 commented Aug 24, 2019 •

edited

Loading

ekuznetsov139 commented Aug 25, 2019

ekuznetsov139 commented Aug 25, 2019 •

edited

Loading

jeffdaily commented Aug 26, 2019

Event leaks #1250

Event leaks #1250

Comments

ekuznetsov139 commented Aug 24, 2019 • edited Loading

ekuznetsov139 commented Aug 25, 2019

ekuznetsov139 commented Aug 25, 2019 • edited Loading

jeffdaily commented Aug 26, 2019

ekuznetsov139 commented Aug 24, 2019 •

edited

Loading

ekuznetsov139 commented Aug 25, 2019 •

edited

Loading