-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add exception handling support for waitall #14397
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was by design, given the documentation. What changed?
We had earlier decided this based on complication associated with adding global exception_ptr and having to reset all the exception_ptrs corresponding to vars and ops.Another, reason was the performance impact. I think we overestimated the difficulty since the decision to used shared_ptr<exception_ptr> enables us to modify the exception_ptrs associated with vars and ops by just setting it to nullptr. Also, there would be performance implication if there is an exception in the code and this can be more pronounced for a tool supposed to used only for benchmarking. But customers seems to be using it outside benchmarking and the performance impact will only happen when there is an exception thrown so it should be acceptable. |
@mxnet-label-bot add [Exception Handling, pr-work-in-progress] |
This is good stuff. waitall() is needed when we want to synchronize on multiple tensors. |
this is ready for review |
The CI stage is complete but shows pending. Any idea @marcoabreu @lebeg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/engine/threaded_engine.cc
Outdated
@@ -428,6 +447,14 @@ inline void ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) { | |||
for (auto&& i : threaded_opr->mutable_vars) { | |||
if (threaded_opr->opr_exception && *threaded_opr->opr_exception) { | |||
i->var_exception = threaded_opr->opr_exception; | |||
// add current operator exceptions to global exceptions if not already | |||
// added | |||
auto it = std::find(global_exception_refs_.begin(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L452-L457 are used in three places. Can we make it a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review ! I have moved it into a function.
src/resource.cc
Outdated
gpu_parallel_rand_.Get(ctx.dev_id, [ctx, seed, this]() { | ||
return new ResourceParallelRandom<gpu>(ctx, gpu_native_rand_copy_, seed); | ||
})->Seed(seed); | ||
if (ctx != Context::CPU()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a bug? Shall we check ctx == Context::GPU()? There are other non-GPU context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is a bug. I have changed the check for GPU context.
src/engine/threaded_engine.cc
Outdated
std::exception_ptr tmp; | ||
if (!global_exception_refs_.empty()) { | ||
// iterate through all exception refs | ||
for (auto itr = global_exception_refs_.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use range for with const reference? is much less noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
@@ -60,6 +60,9 @@ namespace engine { | |||
// Forward declarations | |||
struct ThreadedOpr; | |||
|
|||
/*! shared_ptr to exception_ptr, used for exception handling */ | |||
typedef std::shared_ptr<std::exception_ptr> ExceptionRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to wrap it in a shared_ptr? exception_ptr has already shared ptr semantics according to cppreference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the name Ref to me is confusing, why not Ptr? why add a suffix of the type at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exception_ptr cannot be dereferenced , so we cannot update the exception object it is pointing to or make it nullptr. Since this is a requirement for us we wrapped it in a shared_ptr. Used ref to make it consistent with other places in MXNet.
src/engine/threaded_engine.cc
Outdated
itr != global_exception_refs_.end(); ++itr) { | ||
const ExceptionRef& ptr = *itr; | ||
// the first exception will be saved to be rethrown later | ||
if (*ptr != nullptr && !tmp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be evaluated in bool context, so less noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
src/engine/threaded_engine.cc
Outdated
@@ -415,6 +415,25 @@ void ThreadedEngine::WaitForAll() { | |||
finished_cv_.wait(lock, [this]() { | |||
return pending_.load() == 0 || kill_.load(); | |||
}); | |||
std::exception_ptr tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall this code be wrapped in a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it is used only once so it is fine to not use a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe then a better variable name than tmp? ex_to_rethrow?
except MXNetError: | ||
caught = True | ||
assert caught, "No exception thrown" | ||
def multiple_waits(waitall=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to use "@raises"? maybe it would be easier to read.
https://nose.readthedocs.io/en/latest/testing_tools.html
At least a small comment explaining the test approach for future readers and that we expect exception to be thrown, is that the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added comments. Intention is to test multiple wait_to_reads and waitalls for vars in same scope.
c5f0b28
to
9e6972a
Compare
…to add_waitall_support
if (!global_exception_refs_.empty()) { | ||
// iterate through all exception refs | ||
for (const auto& global_exception_ref : global_exception_refs_) { | ||
// the first exception will be saved to be rethrown later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the order of exceptions stored in the "global_exception_refs_" ? If we are throwing the first one then is it the innermost in the stack that causes all other exceptions or the outermost ? If its outermost then it might not give correct idea about what was the root cause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@access2rohit the order of the exceptions will be maintained exception thrown first will be rethrown first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
…to add_waitall_support
@anirudh2290 Is this PR good to merge now ? |
Merged. Thanks for your contribution! |
This PR likely broke the profiler tutorial, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/275/pipeline/ |
checking this now. |
* Relax constexpr restriction * Change imagenet_gen_qsym_mkldnn * Add exception handling support for waitall * Fix exception handling documentation * Revert constexpr change * Add comments * Fix test * Skip exception for op check names * Print exceptions thrown for CPP Package NDArray module * Reducing batch_size to make cpp-package example pass * Fix bug: apache#14426 * use ExceptionRef in threaded_engine code * add note for performance impact of waitall * Add check for GPU contxt * Use range for with const reference * Improve comments and error message for exception handling test * Change exception_ptr name in waitall * Fix bug
Description
Add exception handling support for waitall
Fixes: #13234, Fixes: #14426
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments