-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rocm/Roctracer will hang and crash when interrupted by a real-time timer #22
Comments
Replacing real-time timers with a cpu time timer will still hang. This suggests that handling of E_INTR for system calls is not the issue. |
Could you upload the test with cpu time timer? |
You can replace the a real-time timer with a cpu timer by changing line 65 of the "MatrixTranpose.cpp". I have a section of comments before line 65 stating how to do it:
|
@eshcherb We found that the hang has nothing to do with roctracer. In the example attached below, the code does not link against roctrocer and still hangs in the same way. This means that the bug is in the rocm runtime. Where should I report this bug? Or can you forward my reproducer to the responsible developers? |
Thank you for your update, appreciate! |
While we are looking the issue, could you try a workaround: The setting will force ROCr-runtime/KFD to use polling to wait for signals rather than interrupts. |
The `ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/` issue has been worked around by properly setting `HCC_PATH` on the CMake side. The shutdown issue has been worked around by replacing interrupts with polling (suggested at ROCm/roctracer#22 (comment)). Something is wrong with the destruction order in our code, but I cannot easily identify what. It's not the missing `cudaDestoryStream` though. Fixes #3620 (according to `ctest -R save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 --repeat-until-fail 1000`). Fixes #3587 (according to `ctest -R ek_charged_plate --repeat-until-fail 100`). **TODO** - https://github.com/espressomd/docker/blob/master/docker/rocm-python3/Dockerfile-latest needs to be updated to ROCm 3.3 once this pull request is merged.
Hi @mxz297 |
Hi @mxz297 |
The reproducer no longer compiles with a recent rocm. I tweaked it to compile with rocm-4.3.1 and it does not seem to have the problem any more. |
@ROCmSupport I was able to reproduce this issue w/ ROCm 4.5.0 |
@ROCmSupport I observed this issue with ROCm 4.5.2 using hpctoolkit's hpcrun to sample miniqmc. The application deadlocked with a thread stuck in a callstack like that above ending in ioctl (I didn't save the callstack). @jrmadsen Do you have a good reproducer or should we try to build one? I think that any thread that calls HSA::hsa_signal_wait_scacquire is vulnerable. |
@jmellorcrummey I haven't created a reproducer yet. For the time being, I've essentially resorted to just setting namespace
{
int disable_hsa_interrupt_on_load = setenv("HSA_ENABLE_INTERRUPT", "0", 0);
} It is not ideal and definitely causes the CPU utilization to increase but at least it prevents the deadlock. I did delve a little deeper into it. TestingI noticed a possible issue with the implementation in ROCT-Thunk-Interface where int kmtIoctl(int fd, unsigned long request, void *arg)
{
int ret;
do {
ret = ioctl(fd, request, arg);
} while (ret == -1 && (errno == EINTR || errno == EAGAIN));
if (ret == -1 && errno == EBADF) {
/* In case pthread_atfork didn't catch it, this will
* make any subsequent hsaKmt calls fail in CHECK_KFD_OPEN.
*/
pr_err("KFD file descriptor not valid in this process\n");
is_forked_child();
}
return ret;
} and thought maybe this was happening when the signal handler was overwriting
It appears it is the Once I re-implemented #include <signal.h>
#include <pthread.h>
static __thread sigset_t _signal_set;
static void setup_signal_set(void)
{
static __thread size_t _once = 0;
if(_once != 0) return;
_once = 1;
sigemptyset(&_signal_set);
sigaddset(&_signal_set, SIGPROF);
sigaddset(&_signal_set, SIGALRM);
sigaddset(&_signal_set, SIGVTALRM);
}
/* Call ioctl, restarting if it is interrupted */
int kmtIoctl(int fd, unsigned long request, void *arg)
{
int ret = 0;
int err = 0;
setup_signal_set();
pthread_sigmask(SIG_BLOCK, &_signal_set, NULL);
do
{
ret = ioctl(fd, request, arg);
err = errno;
}
while(ret == -1 && (err == EINTR || err == EAGAIN));
if (ret == -1 && err == EBADF) {
/* In case pthread_atfork didn't catch it, this will
* make any subsequent hsaKmt calls fail in CHECK_KFD_OPEN.
*/
pr_err("KFD file descriptor not valid in this process\n");
is_forked_child();
}
pthread_sigmask(SIG_UNBLOCK, &_signal_set, NULL);
return ret;
} Thus, as far as I can tell, it appears that when the signal is delivered during the EDIT: as a sanity check, I removed the signal blocking and all my tests resumed deadlocking. |
As you can see from this snippet from one of the outputs, there were several potential cases for the signal handler to interrupt the |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| WALL CLOCK TIME (VIA SAMPLING) |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | VAR | STDDEV | % SELF |
|--------------------------------------------------------------------------------------------------------------------------|--------|--------|---------------------|--------|-----------|----------|----------|----------|----------|----------|--------|
...
| |0>>> |_hipLaunchKernel | 2337 | 1 | sampling_wall_clock | sec | 14.395847 | 0.006160 | 0.001834 | 0.207173 | 0.000041 | 0.006394 | 0.0 |
| |0>>> |_hip_impl::hipLaunchKernelGGLImpl(unsigned long, dim3 const&, dim3 const&, unsigned int, ihipStream_t*, void**) | 2282 | 2 | sampling_wall_clock | sec | 13.963639 | 0.006119 | 0.001836 | 0.207173 | 0.000040 | 0.006350 | 0.0 |
| |0>>> |_hipModuleGetTexRef | 2282 | 3 | sampling_wall_clock | sec | 13.963639 | 0.006119 | 0.001836 | 0.207173 | 0.000040 | 0.006350 | 0.0 |
| |0>>> |_hipTexObjectCreate | 2258 | 4 | sampling_wall_clock | sec | 13.788658 | 0.006107 | 0.001836 | 0.207173 | 0.000041 | 0.006377 | 0.0 |
| |0>>> |_hipTexObjectCreate | 2255 | 5 | sampling_wall_clock | sec | 13.774659 | 0.006108 | 0.001836 | 0.207173 | 0.000041 | 0.006381 | 0.1 |
| |0>>> |___new_sem_wait_slow.constprop.0 | 2016 | 6 | sampling_wall_clock | sec | 10.615110 | 0.005265 | 0.001845 | 0.207173 | 0.000023 | 0.004846 | 0.0 |
| |0>>> |_do_futex_wait.constprop.0 | 2015 | 7 | sampling_wall_clock | sec | 10.610925 | 0.005266 | 0.001845 | 0.207173 | 0.000023 | 0.004847 | 100.0 |
| |0>>> |_hipTexObjectCreate | 236 | 6 | sampling_wall_clock | sec | 3.147407 | 0.013336 | 0.001836 | 0.090864 | 0.000131 | 0.011427 | 0.4 |
| |0>>> |_hipTexObjectCreate | 233 | 7 | sampling_wall_clock | sec | 3.134396 | 0.013452 | 0.001836 | 0.090864 | 0.000131 | 0.011454 | 0.1 |
| |0>>> |_hipTexObjectCreate | 231 | 8 | sampling_wall_clock | sec | 3.125229 | 0.013529 | 0.001836 | 0.090864 | 0.000132 | 0.011473 | 0.0 |
| |0>>> |_hipTexObjectCreate | 229 | 9 | sampling_wall_clock | sec | 3.085253 | 0.013473 | 0.001836 | 0.090864 | 0.000130 | 0.011421 | 0.0 |
| |0>>> |_hipTexObjectCreate | 229 | 10 | sampling_wall_clock | sec | 3.085253 | 0.013473 | 0.001836 | 0.090864 | 0.000130 | 0.011421 | 0.0 |
| |0>>> |_rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned... | 227 | 11 | sampling_wall_clock | sec | 3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 | 0.0 |
| |0>>> |_rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, h... | 227 | 12 | sampling_wall_clock | sec | 3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 | 0.0 |
| |0>>> |_rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long,... | 227 | 13 | sampling_wall_clock | sec | 3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 | 98.9 |
| |0>>> |_hsaKmtWaitOnEvent | 3 | 14 | sampling_wall_clock | sec | 0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 | 0.0 |
| |0>>> |_hsaKmtWaitOnMultipleEvents | 3 | 15 | sampling_wall_clock | sec | 0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 | 0.0 |
| |0>>> |_kmtIoctl | 3 | 16 | sampling_wall_clock | sec | 0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 | 0.0 |
| |0>>> |_pthread_sigmask | 3 | 17 | sampling_wall_clock | sec | 0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 | 100.0 | |
@jrmadsen We agree with your diagnosis. i meant that any thread calling hsa_signal_wait_sacquire is vulnerable to a deadlock. We also believe that the deadlocks arise due to problems with a signal interrupting the ioctl that it calls. We are aware of the flag to block interrupts. We have been using this for 2.5 years when profiling HIP programs with roctracer. I ran into the problem profiling programs using OpenMP offloading through HSA, which didn’t go through the path that set the environment variable to disable HSA interrupts. We’re glad that you are looking into this. |
Why is this issue marked as closed? ROCm still hangs when interrupted with a Linux REALTIME timer? |
@jmellorcrummey -- just as an update, I have been trying to reproduce this with both the examples here, but haven't had much luck. I did find another issue in initializing roctracer that I've reported internally:
Still working on it. |
the way i saw this recently was using hpctoolkit’s hpcrun to profile miniqmc. perhaps you can reproduce it with linux perf sampling with realtime. otherwise we could write a little preloaded library that does nothing but set up realtime sampling on the main thread and wrap pthread_create to start sampling for every other thread. then preload the library and launch miniqmc. |
That's a good idea -- I had been trying to build Omnitrace and use that, but ran into fun build issues on our cluster. I'll try that next. Have you been triggering this on Crusher? I doubt it depends on the system, but thought I'd check. |
i haven’t tried this one on crusher. i was seeing it on our local system with mi50 and rocm 5.1. |
@arghdos You shouldn't have to build it. You should just be able to use this installer and then: export HSA_ENABLE_INTERRUPT=1
export OMNITRACE_ENABLE_SAMPLING=ON
export OMNITRACE_SAMPLING_FREQ=500 If you haven't been seeing it with omnitrace it's bc I do |
Didn't look close enough to see you didn't override it if set in the env :) |
We encountered a hang when using Roctracer to collect real-time profiling data on both CPUs and GPUs. HPCToolkit collects real time profiling data by repetitively setting a real-time timer and register a signal handler to record samples when the timer goes off. Our example application (the roctracer example) hangs non-deterministically in ioctl.
I created a reproducer based on the Roctracer example, which contains only the real-timer logics without any other HPCToolkit logic.
If we compile this reproducer and run it, it will hang non-deterministically at the following stack trace:
If we change the timer interval from 2.5 millisecond to 1 millisecond, the reproduce will deterministically crash at the following stack trace:
These stack traces include code from roctracer, and also code from other AMD toolchains such as HIP. I cannot really determine whether the root cause of the issue is in HIP or Roctracer, but since HPCToolkit is a direct user of Roctracer, I will post it here.
Reproducer.zip
The text was updated successfully, but these errors were encountered: