-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-266] Fix cudnn_conv and cudnn_deconv deadlock #10392
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, thanks a lot for fixing this so quickly!
@KellenSunderland @marcoabreu Thanks for the initial investigation of the root cause. Even though this should be a safe fix, I am still not clear why the deadlock could happen. It seems that a thread was running the pushed async function again and again, and hence the worker thread has to wait until the var is released. Could you provide some insights. GDB doesn't seem to help very much here. I had to add a lot of logging messages to see the execution. |
&back_algo_w_)) { | ||
// Not in algo registry, must determine via *Get*() or *Find*() | ||
Engine::VarHandle var = Engine::Get()->NewVariable(); | ||
Engine::Get()->PushAsync([=](RunContext rctx, Engine::CallbackOnComplete on_complete) { | ||
mshadow::Stream<gpu> *s = rctx.get_stream<gpu>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation of this piece of code is indeed not right. I changed it in a previous commit but it resulted in huge code difference that makes code review much harder. I can make the indentation right either after this PR is merged or after everyone finishes reviewing the current code changes. :)
&back_algo_, &back_algo_w_)) { | ||
// Not in algo registry, must determine via *Get*() or *Find*() | ||
Engine::VarHandle var = Engine::Get()->NewVariable(); | ||
Engine::Get()->PushAsync([=](RunContext rctx, Engine::CallbackOnComplete on_complete) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation?
Please fix indentation and we can merge |
Indentation fixed. Let's wait till the CI passes. |
* Fix deadlock of cudnn_conv wrapper * Fix deconv deadlock * Fix lint * Revert "Fix lint" This reverts commit 66f0936. * Fix lint * Fix indentation
* Fix deadlock of cudnn_conv wrapper * Fix deconv deadlock * Fix lint * Revert "Fix lint" This reverts commit 66f0936. * Fix lint * Fix indentation
Description
This PR is expected to address the deadlock issue #10341 introduced by #9677.
Fixed the issue by not pushing the async function into the engine since this block of code is already being executed by an worker thread for a gpu context. Both
cudnn_conv
andcudnn_deconv
are fixed. Previous unit test that was temporarily disabled due to the deadlock is re-enabled.@piiswrong @eric-haibin-lin @zheng-da
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments