Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor code away from using low-level termination primitives, to higher-level routines #824

Merged
merged 10 commits into from
Jun 10, 2020

Conversation

PhilMiller
Copy link
Member

@PhilMiller PhilMiller commented May 26, 2020

No description provided.

@PhilMiller PhilMiller requested a review from lifflander May 26, 2020 15:17
@PhilMiller
Copy link
Member Author

PhilMiller commented May 26, 2020

This doesn't actually address #649 at all yet. I just wanted to get rid of dumb addAction calls to have fewer sites to update

Somehow, this reveals some test failures, both timeouts and inconsistent direct failures.

  • lb_iter seems to be ok
  • test_term_chaining seems to be ok
  • The dep_send_chain failures seems to have been fixed by adding the apparently missing finishedEpoch call
  • migrate_collection sometimes passes, and sometimes hangs
  • vt:TestTermCleanup.test_termination_cleanup_2_ sometimes passes, and sometimes hangs with a failure message printed

I don't really see how this should have messed with any test behaviors at all. Is there weirdness in where one can test isEpochTerminated that I'm not respecting?

@PhilMiller
Copy link
Member Author

Ah, I see // Might return (conservatively) false if the epoch is non-local so I'll need to adjust things at least some.

@PhilMiller
Copy link
Member Author

In testEpochTerminated, it looks to me like a remote rooted epoch can never return Terminated - it makes the request to the root, and notes it, but never checks whether a response came back in any way. The response handler epochTerminated() triggers associated actions, but the testing methods don't refer to any state that it updates

@PhilMiller
Copy link
Member Author

I'm testing a fix that checks against windows with recorded terminated epochs

@PhilMiller
Copy link
Member Author

This doesn't seem to help, which surprises me:

 TermStatusEnum TerminationDetector::testEpochTerminated(EpochType epoch) {
   TermStatusEnum status = TermStatusEnum::Pending;
   auto const& is_rooted_epoch = epoch::EpochManip::isRooted(epoch);
 
-  if (is_rooted_epoch) {
+  if (getWindow(epoch)->isTerminated(epoch)) {
+    status = TermStatusEnum::Terminated;
+  } else if (is_rooted_epoch) {
     auto const& this_node = theContext()->getNode();
     auto const& root = epoch::EpochManip::node(epoch);
     if (root == this_node) {

@codecov
Copy link

codecov bot commented May 26, 2020

Codecov Report

Merging #824 into develop will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #824      +/-   ##
===========================================
- Coverage    80.49%   80.47%   -0.03%     
===========================================
  Files          353      352       -1     
  Lines        11207    11173      -34     
===========================================
- Hits          9021     8991      -30     
+ Misses        2186     2182       -4     
Impacted Files Coverage Δ
src/vt/rdmahandle/sub_handle.impl.h 0.00% <ø> (ø)
src/vt/scheduler/scheduler.h 100.00% <ø> (ø)
src/vt/termination/termination.h 100.00% <ø> (ø)
src/vt/vrt/collection/types/migratable.h 50.00% <ø> (ø)
src/vt/vrt/collection/balance/elm_stats.impl.h 92.68% <100.00%> (+0.18%) ⬆️
tests/unit/termination/test_term_chaining.cc 98.46% <100.00%> (-0.14%) ⬇️
tests/unit/termination/test_term_cleanup.cc 96.92% <100.00%> (-0.27%) ⬇️
tests/unit/termination/test_term_dep_send_chain.cc 100.00% <100.00%> (ø)

@PhilMiller
Copy link
Member Author

There's still going to be plenty to do in the way of fix-ups:

> git grep addAction | wc -l
115
> git grep runScheduler | wc -l
84

@PhilMiller
Copy link
Member Author

An observation that very little inside src/vt uses TerminationDetector::addAction, while there are lots of calls in the examples and tests

src/vt/messaging/dependent_send_chain.h:    theTerm()->addActionUnique(last_epoch_, PendingClosure(std::move(link)));
src/vt/messaging/dependent_send_chain.h:    // having an epoch to call addAction on, rather than edge cases of
src/vt/vrt/collection/balance/baselb/baselb.cc:  theTerm()->addAction(migration_epoch_, [this]{ this->migrationDone(); });
src/vt/vrt/collection/manager.impl.h:    theTerm()->addAction(insert_epoch, finished_insert_trigger);
git grep addAction  examples/ tests/ |wc -l
81

@PhilMiller
Copy link
Member Author

An observation that very little inside src/vt uses TerminationDetector::addAction, while there are lots of calls in the examples and tests

src/vt/messaging/dependent_send_chain.h:    theTerm()->addActionUnique(last_epoch_, PendingClosure(std::move(link)));
src/vt/messaging/dependent_send_chain.h:    // having an epoch to call addAction on, rather than edge cases of
src/vt/vrt/collection/balance/baselb/baselb.cc:  theTerm()->addAction(migration_epoch_, [this]{ this->migrationDone(); });
src/vt/vrt/collection/manager.impl.h:    theTerm()->addAction(insert_epoch, finished_insert_trigger);

The first three of these definitely belong in the global epoch - they are completely closed WRT any other epoch that might be floating around. I'm less clear on the last one.

@PhilMiller
Copy link
Member Author

One of the examples, reduce_integral, was happier without an addAction

The other two

> git grep addAction  examples/
examples/termination/termination_collective.cc:  vt::theTerm()->addAction(epoch, [=]{
examples/termination/termination_rooted.cc:    vt::theTerm()->addAction(epoch, [=]{

are both well suited to global epoch

@PhilMiller
Copy link
Member Author

examples/termination/termination_collective.cc
examples/termination/termination_rooted.cc

Are probably not first-level usage examples we want to highlight so much, but their addAction calls could/should run in the global epoch.

runSchedulerThrough(ep);
}

void runInEpochCollective(ActionType&& fn) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking (for the future) that maybe runInEpochRooted and runInEpochCollective should return a holder that executes until the end it goes out of scope. So it can be chained with other operations if the user wants. But this is good for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I exactly get what you're thinking of. Do you mean a holder that doesn't execute until it goes out of scope (or is manually released)? What sort of chaining do you have in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see the potential for a modest generalization of DependentSendChain, into something like DependentActionChain, on which you could then wait on the whole chain (effectively, the last epoch in the chain)

@lifflander
Copy link
Collaborator

Convert out of draft mode?

@PhilMiller
Copy link
Member Author

PhilMiller commented Jun 2, 2020 via email

@PhilMiller PhilMiller changed the title 649 action epoch Refactor code away from using low-level termination primitives, to higher-level routines Jun 2, 2020
@PhilMiller PhilMiller marked this pull request as ready for review June 2, 2020 14:10
@PhilMiller
Copy link
Member Author

There had been a failure in vt:TestLB.test_lb_1_proc_2 on both Travis builds before I merged with develop. After the merge, it doesn't reproduce in 100 runs. I don't see anything in the commit logs between those two that would have affected this test, though. Am I missing something? Should we just go ahead with the merge?

@lifflander
Copy link
Collaborator

There had been a failure in vt:TestLB.test_lb_1_proc_2 on both Travis builds before I merged with develop. After the merge, it doesn't reproduce in 100 runs. I don't see anything in the commit logs between those two that would have affected this test, though. Am I missing something? Should we just go ahead with the merge?

That's surprising to me. I highly doubt the failure is related to this PR. I say we move ahead with this PR. Did you look at the output from the address sanitizer on the Github action run to ensure nothing weird is happening?

@lifflander lifflander changed the title Refactor code away from using low-level termination primitives, to higher-level routines 649 Refactor code away from using low-level termination primitives, to higher-level routines Jun 9, 2020
@lifflander lifflander changed the title 649 Refactor code away from using low-level termination primitives, to higher-level routines Refactor code away from using low-level termination primitives, to higher-level routines Jun 9, 2020
@lifflander
Copy link
Collaborator

So this has had the same spurious failure of test_lb_1_proc_2 as #826

We need to get to the bottom of this error.

Copy link
Collaborator

Codacy Here is an overview of what got changed by this pull request:

Clones removed
==============
+ examples/collection/lb_iter.cc  -2
+ examples/collection/migrate_collection.cc  -4
         

See the complete overview on Codacy

@lifflander lifflander merged commit 26a049d into develop Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants