Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add exception handling support for waitall #14397

Merged
merged 25 commits into from
Apr 8, 2019

Conversation

anirudh2290
Copy link
Member

@anirudh2290 anirudh2290 commented Mar 12, 2019

Description

Add exception handling support for waitall
Fixes: #13234, Fixes: #14426

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • added exception handling support for waitall
  • added tests
  • modified documentation

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@anirudh2290 anirudh2290 requested a review from szha as a code owner March 12, 2019 02:12
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was by design, given the documentation. What changed?

@anirudh2290
Copy link
Member Author

We had earlier decided this based on complication associated with adding global exception_ptr and having to reset all the exception_ptrs corresponding to vars and ops.Another, reason was the performance impact. I think we overestimated the difficulty since the decision to used shared_ptr<exception_ptr> enables us to modify the exception_ptrs associated with vars and ops by just setting it to nullptr. Also, there would be performance implication if there is an exception in the code and this can be more pronounced for a tool supposed to used only for benchmarking. But customers seems to be using it outside benchmarking and the performance impact will only happen when there is an exception thrown so it should be acceptable.

@karan6181
Copy link
Contributor

@mxnet-label-bot add [Exception Handling, pr-work-in-progress]

@marcoabreu marcoabreu added Exception Handling pr-work-in-progress PR is still work in progress labels Mar 12, 2019
@apeforest
Copy link
Contributor

This is good stuff. waitall() is needed when we want to synchronize on multiple tensors.

src/engine/threaded_engine.h Show resolved Hide resolved
src/engine/threaded_engine.h Outdated Show resolved Hide resolved
docs/architecture/exception_handling.md Show resolved Hide resolved
@anirudh2290 anirudh2290 changed the title [WIP] Add exception handling support for waitall Add exception handling support for waitall Mar 14, 2019
@anirudh2290
Copy link
Member Author

this is ready for review

@anirudh2290
Copy link
Member Author

The CI stage is complete but shows pending. Any idea @marcoabreu @lebeg

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Some minor comments.

@larroy Please also take a look at this PR. You raised questions about using shared_ptr around exception_ptr. Aniruh had an answer here.

@@ -428,6 +447,14 @@ inline void ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) {
for (auto&& i : threaded_opr->mutable_vars) {
if (threaded_opr->opr_exception && *threaded_opr->opr_exception) {
i->var_exception = threaded_opr->opr_exception;
// add current operator exceptions to global exceptions if not already
// added
auto it = std::find(global_exception_refs_.begin(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L452-L457 are used in three places. Can we make it a function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review ! I have moved it into a function.

src/resource.cc Outdated
gpu_parallel_rand_.Get(ctx.dev_id, [ctx, seed, this]() {
return new ResourceParallelRandom<gpu>(ctx, gpu_native_rand_copy_, seed);
})->Seed(seed);
if (ctx != Context::CPU()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug? Shall we check ctx == Context::GPU()? There are other non-GPU context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a bug. I have changed the check for GPU context.

@anirudh2290 anirudh2290 added pr-awaiting-review PR is waiting for code review pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-work-in-progress PR is still work in progress labels Mar 21, 2019
std::exception_ptr tmp;
if (!global_exception_refs_.empty()) {
// iterate through all exception refs
for (auto itr = global_exception_refs_.begin();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use range for with const reference? is much less noisy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

@@ -60,6 +60,9 @@ namespace engine {
// Forward declarations
struct ThreadedOpr;

/*! shared_ptr to exception_ptr, used for exception handling */
typedef std::shared_ptr<std::exception_ptr> ExceptionRef;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to wrap it in a shared_ptr? exception_ptr has already shared ptr semantics according to cppreference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the name Ref to me is confusing, why not Ptr? why add a suffix of the type at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exception_ptr cannot be dereferenced , so we cannot update the exception object it is pointing to or make it nullptr. Since this is a requirement for us we wrapped it in a shared_ptr. Used ref to make it consistent with other places in MXNet.

itr != global_exception_refs_.end(); ++itr) {
const ExceptionRef& ptr = *itr;
// the first exception will be saved to be rethrown later
if (*ptr != nullptr && !tmp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be evaluated in bool context, so less noise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@@ -415,6 +415,25 @@ void ThreadedEngine::WaitForAll() {
finished_cv_.wait(lock, [this]() {
return pending_.load() == 0 || kill_.load();
});
std::exception_ptr tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall this code be wrapped in a function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is used only once so it is fine to not use a function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe then a better variable name than tmp? ex_to_rethrow?

except MXNetError:
caught = True
assert caught, "No exception thrown"
def multiple_waits(waitall=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to use "@raises"? maybe it would be easier to read.

https://nose.readthedocs.io/en/latest/testing_tools.html

At least a small comment explaining the test approach for future readers and that we expect exception to be thrown, is that the intent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added comments. Intention is to test multiple wait_to_reads and waitalls for vars in same scope.

if (!global_exception_refs_.empty()) {
// iterate through all exception refs
for (const auto& global_exception_ref : global_exception_refs_) {
// the first exception will be saved to be rethrown later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the order of exceptions stored in the "global_exception_refs_" ? If we are throwing the first one then is it the innermost in the stack that causes all other exceptions or the outermost ? If its outermost then it might not give correct idea about what was the root cause

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@access2rohit the order of the exceptions will be maintained exception thrown first will be rethrown first.

Copy link
Contributor

@lebeg lebeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@piyushghai
Copy link
Contributor

@anirudh2290 Is this PR good to merge now ?

@wkcn wkcn merged commit 3781816 into apache:master Apr 8, 2019
@wkcn
Copy link
Member

wkcn commented Apr 8, 2019

Merged. Thanks for your contribution!

@anirudh2290
Copy link
Member Author

checking this now.

@anirudh2290 anirudh2290 mentioned this pull request Apr 11, 2019
7 tasks
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Relax constexpr restriction

* Change imagenet_gen_qsym_mkldnn

* Add exception handling support for waitall

* Fix exception handling documentation

* Revert constexpr change

* Add comments

* Fix test

* Skip exception for op check names

* Print exceptions thrown for CPP Package NDArray module

* Reducing batch_size to make cpp-package example pass

* Fix bug: apache#14426

* use ExceptionRef in threaded_engine code

* add note for performance impact of waitall

* Add check for GPU contxt

* Use range for with const reference

* Improve comments and error message for exception handling test

* Change exception_ptr name in waitall

* Fix bug
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Exception Handling pr-awaiting-merge Review and CI is complete. Ready to Merge pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mx.random.seed with ctx failures on a gpu build when run with cpu context waitforall() hides errors