Fix deadlock in ROCM EP when profiling enabled #13301

abudup · 2022-10-12T18:11:13Z

Description

Implement a workaround for a potential bug in the ROCm framework that causes deadlocks on StableDiffusion models when executed with profiling enabled.

Motivation and Context

ORT deadlocks when executing the StableDiffusion model on AMD GPUs with ROCm profiling enabled. I suspect that this is most likely a bug in the ROCm framework, which is triggered by enqueueing a callback on the device stream, and then immediately synchronizing on the stream.

In this particular case, there isn't really a necessity to do this: we can swap the order by first synchronizing on the stream and then releasing the buffers synchronously, once we know that the device stream has completed execution.

abudup · 2022-10-12T18:12:17Z

@RandySheriff, @zhangyaobit requesting a review, please take a look.

yuslepukhin · 2022-10-13T18:07:00Z

The PR description seemingly states that there is not a clarity on the potential bug and whether it exists. Without that clarity it is more like shooting in the dark.
Can you sync up with AMD on this?

cloudhan · 2022-10-14T05:01:10Z

I actually met this problem last month. This seems only happen when more than one sessions are under profiling and the deadlock occurs on trace dumping.

abudup · 2022-10-17T17:30:20Z

@cloudhan: thank you for confirming this behavior!

@yuslepukhin: I understand and share your concerns about not having clarity on the root-cause of the problem. I have discussed with some AMD engineers, and they've requested a minimal example to reproduce the behavior. Unfortunately, this has proven to be rather challenging. The stack traces I've observed are very similar to the ones described in this bug. In the meantime, I thought that there might be value in a workaround for this problem in onnxruntime.

yuslepukhin · 2022-10-17T17:33:00Z

@cloudhan: thank you for confirming this behavior!

@yuslepukhin: I understand and share your concerns about not having clarity on the root-cause of the problem. I have discussed with some AMD engineers, and they've requested a minimal example to reproduce the behavior. Unfortunately, this has proven to be rather challenging. The stack traces I've observed are very similar to the ones described in this bug. In the meantime, I thought that there might be value in a workaround for this problem in onnxruntime.

Would this invalidate our knowledge of the behavior, manifest itself in other ways and make us understand the problem less?

abudup · 2022-10-17T18:01:04Z

I'm assuming that you're referring to the other bug report. I don't see how it invalidates anything, or make us understand the problem less. As for manifesting in other ways, I suppose the bug could manifest in other ways.

Fix deadlock in ROCM EP when profiling enabled

ab60d6f

abudup requested review from RandySheriffH and zhangyaobit October 12, 2022 18:13

abudup closed this Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in ROCM EP when profiling enabled #13301

Fix deadlock in ROCM EP when profiling enabled #13301

abudup commented Oct 12, 2022

abudup commented Oct 12, 2022

yuslepukhin commented Oct 13, 2022

cloudhan commented Oct 14, 2022

abudup commented Oct 17, 2022 •

edited

Loading

yuslepukhin commented Oct 17, 2022

abudup commented Oct 17, 2022

Fix deadlock in ROCM EP when profiling enabled #13301

Fix deadlock in ROCM EP when profiling enabled #13301

Conversation

abudup commented Oct 12, 2022

Description

Motivation and Context

abudup commented Oct 12, 2022

yuslepukhin commented Oct 13, 2022

cloudhan commented Oct 14, 2022

abudup commented Oct 17, 2022 • edited Loading

yuslepukhin commented Oct 17, 2022

abudup commented Oct 17, 2022

abudup commented Oct 17, 2022 •

edited

Loading