Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in ROCM EP when profiling enabled #13301

Closed
wants to merge 1 commit into from

Conversation

abudup
Copy link
Contributor

@abudup abudup commented Oct 12, 2022

Description

Implement a workaround for a potential bug in the ROCm framework that causes deadlocks on StableDiffusion models when executed with profiling enabled.

Motivation and Context

ORT deadlocks when executing the StableDiffusion model on AMD GPUs with ROCm profiling enabled. I suspect that this is most likely a bug in the ROCm framework, which is triggered by enqueueing a callback on the device stream, and then immediately synchronizing on the stream.

In this particular case, there isn't really a necessity to do this: we can swap the order by first synchronizing on the stream and then releasing the buffers synchronously, once we know that the device stream has completed execution.

@abudup
Copy link
Contributor Author

abudup commented Oct 12, 2022

@RandySheriff, @zhangyaobit requesting a review, please take a look.

@yuslepukhin
Copy link
Member

The PR description seemingly states that there is not a clarity on the potential bug and whether it exists. Without that clarity it is more like shooting in the dark.
Can you sync up with AMD on this?

@cloudhan
Copy link
Contributor

I actually met this problem last month. This seems only happen when more than one sessions are under profiling and the deadlock occurs on trace dumping.

@abudup
Copy link
Contributor Author

abudup commented Oct 17, 2022

@cloudhan: thank you for confirming this behavior!

@yuslepukhin: I understand and share your concerns about not having clarity on the root-cause of the problem. I have discussed with some AMD engineers, and they've requested a minimal example to reproduce the behavior. Unfortunately, this has proven to be rather challenging. The stack traces I've observed are very similar to the ones described in this bug. In the meantime, I thought that there might be value in a workaround for this problem in onnxruntime.

@yuslepukhin
Copy link
Member

@cloudhan: thank you for confirming this behavior!

@yuslepukhin: I understand and share your concerns about not having clarity on the root-cause of the problem. I have discussed with some AMD engineers, and they've requested a minimal example to reproduce the behavior. Unfortunately, this has proven to be rather challenging. The stack traces I've observed are very similar to the ones described in this bug. In the meantime, I thought that there might be value in a workaround for this problem in onnxruntime.

Would this invalidate our knowledge of the behavior, manifest itself in other ways and make us understand the problem less?

@abudup
Copy link
Contributor Author

abudup commented Oct 17, 2022

I'm assuming that you're referring to the other bug report. I don't see how it invalidates anything, or make us understand the problem less. As for manifesting in other ways, I suppose the bug could manifest in other ways.

@abudup abudup closed this Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants