Refactor code away from using low-level termination primitives, to higher-level routines #824

PhilMiller · 2020-05-26T15:17:31Z

No description provided.

src/vt/vrt/collection/balance/greedylb/greedylb.cc

PhilMiller · 2020-05-26T15:27:07Z

This doesn't actually address #649 at all yet. I just wanted to get rid of dumb addAction calls to have fewer sites to update

Somehow, this reveals some test failures, both timeouts and inconsistent direct failures.

lb_iter seems to be ok
test_term_chaining seems to be ok
The dep_send_chain failures seems to have been fixed by adding the apparently missing finishedEpoch call
migrate_collection sometimes passes, and sometimes hangs
vt:TestTermCleanup.test_termination_cleanup_2_ sometimes passes, and sometimes hangs with a failure message printed

I don't really see how this should have messed with any test behaviors at all. Is there weirdness in where one can test isEpochTerminated that I'm not respecting?

PhilMiller · 2020-05-26T15:29:43Z

Ah, I see // Might return (conservatively) false if the epoch is non-local so I'll need to adjust things at least some.

PhilMiller · 2020-05-26T15:38:01Z

In testEpochTerminated, it looks to me like a remote rooted epoch can never return Terminated - it makes the request to the root, and notes it, but never checks whether a response came back in any way. The response handler epochTerminated() triggers associated actions, but the testing methods don't refer to any state that it updates

PhilMiller · 2020-05-26T15:48:09Z

I'm testing a fix that checks against windows with recorded terminated epochs

PhilMiller · 2020-05-26T15:51:57Z

This doesn't seem to help, which surprises me:

 TermStatusEnum TerminationDetector::testEpochTerminated(EpochType epoch) {
   TermStatusEnum status = TermStatusEnum::Pending;
   auto const& is_rooted_epoch = epoch::EpochManip::isRooted(epoch);
 
-  if (is_rooted_epoch) {
+  if (getWindow(epoch)->isTerminated(epoch)) {
+    status = TermStatusEnum::Terminated;
+  } else if (is_rooted_epoch) {
     auto const& this_node = theContext()->getNode();
     auto const& root = epoch::EpochManip::node(epoch);
     if (root == this_node) {

codecov · 2020-05-26T16:02:20Z

Codecov Report

Merging #824 into develop will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop     #824      +/-   ##
===========================================
- Coverage    80.49%   80.47%   -0.03%     
===========================================
  Files          353      352       -1     
  Lines        11207    11173      -34     
===========================================
- Hits          9021     8991      -30     
+ Misses        2186     2182       -4

Impacted Files	Coverage Δ
src/vt/rdmahandle/sub_handle.impl.h	`0.00% <ø> (ø)`
src/vt/scheduler/scheduler.h	`100.00% <ø> (ø)`
src/vt/termination/termination.h	`100.00% <ø> (ø)`
src/vt/vrt/collection/types/migratable.h	`50.00% <ø> (ø)`
src/vt/vrt/collection/balance/elm_stats.impl.h	`92.68% <100.00%> (+0.18%)`	⬆️
tests/unit/termination/test_term_chaining.cc	`98.46% <100.00%> (-0.14%)`	⬇️
tests/unit/termination/test_term_cleanup.cc	`96.92% <100.00%> (-0.27%)`	⬇️
tests/unit/termination/test_term_dep_send_chain.cc	`100.00% <100.00%> (ø)`

src/vt/vrt/collection/balance/greedylb/greedylb.cc

PhilMiller · 2020-05-26T21:26:30Z

There's still going to be plenty to do in the way of fix-ups:

> git grep addAction | wc -l
115
> git grep runScheduler | wc -l
84

PhilMiller · 2020-05-26T22:53:39Z

An observation that very little inside src/vt uses TerminationDetector::addAction, while there are lots of calls in the examples and tests

src/vt/messaging/dependent_send_chain.h:    theTerm()->addActionUnique(last_epoch_, PendingClosure(std::move(link)));
src/vt/messaging/dependent_send_chain.h:    // having an epoch to call addAction on, rather than edge cases of
src/vt/vrt/collection/balance/baselb/baselb.cc:  theTerm()->addAction(migration_epoch_, [this]{ this->migrationDone(); });
src/vt/vrt/collection/manager.impl.h:    theTerm()->addAction(insert_epoch, finished_insert_trigger);

git grep addAction  examples/ tests/ |wc -l
81

PhilMiller · 2020-05-26T23:22:14Z

An observation that very little inside src/vt uses TerminationDetector::addAction, while there are lots of calls in the examples and tests

src/vt/messaging/dependent_send_chain.h:    theTerm()->addActionUnique(last_epoch_, PendingClosure(std::move(link)));
src/vt/messaging/dependent_send_chain.h:    // having an epoch to call addAction on, rather than edge cases of
src/vt/vrt/collection/balance/baselb/baselb.cc:  theTerm()->addAction(migration_epoch_, [this]{ this->migrationDone(); });
src/vt/vrt/collection/manager.impl.h:    theTerm()->addAction(insert_epoch, finished_insert_trigger);

The first three of these definitely belong in the global epoch - they are completely closed WRT any other epoch that might be floating around. I'm less clear on the last one.

PhilMiller · 2020-05-27T01:40:59Z

One of the examples, reduce_integral, was happier without an addAction

The other two

> git grep addAction  examples/
examples/termination/termination_collective.cc:  vt::theTerm()->addAction(epoch, [=]{
examples/termination/termination_rooted.cc:    vt::theTerm()->addAction(epoch, [=]{

are both well suited to global epoch

PhilMiller · 2020-06-01T16:29:49Z

examples/termination/termination_collective.cc
examples/termination/termination_rooted.cc

Are probably not first-level usage examples we want to highlight so much, but their addAction calls could/should run in the global epoch.

lifflander · 2020-06-02T03:39:37Z

src/vt/scheduler/scheduler.cc

+  runSchedulerThrough(ep);
+}
+
+void runInEpochCollective(ActionType&& fn) {


I'm thinking (for the future) that maybe runInEpochRooted and runInEpochCollective should return a holder that executes until the end it goes out of scope. So it can be chained with other operations if the user wants. But this is good for now.

I don't think I exactly get what you're thinking of. Do you mean a holder that doesn't execute until it goes out of scope (or is manually released)? What sort of chaining do you have in mind?

I could see the potential for a modest generalization of DependentSendChain, into something like DependentActionChain, on which you could then wait on the whole chain (effectively, the last epoch in the chain)

lifflander · 2020-06-02T03:40:16Z

Convert out of draft mode?

PhilMiller · 2020-06-02T05:46:46Z

This is incomplete. It doesn't address the issue stated in 649. Though it could probably be retitled and merged sooner. Will check tomorrow.

…

On Mon, Jun 1, 2020, 23:40 Jonathan Lifflander ***@***.***> wrote: Convert out of draft mode? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#824 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA64ADCB4ZMVEFVFJ22H2TRURYCZANCNFSM4NKNGZQA> .

PhilMiller · 2020-06-02T14:51:19Z

There had been a failure in vt:TestLB.test_lb_1_proc_2 on both Travis builds before I merged with develop. After the merge, it doesn't reproduce in 100 runs. I don't see anything in the commit logs between those two that would have affected this test, though. Am I missing something? Should we just go ahead with the merge?

lifflander · 2020-06-02T23:47:08Z

There had been a failure in vt:TestLB.test_lb_1_proc_2 on both Travis builds before I merged with develop. After the merge, it doesn't reproduce in 100 runs. I don't see anything in the commit logs between those two that would have affected this test, though. Am I missing something? Should we just go ahead with the merge?

That's surprising to me. I highly doubt the failure is related to this PR. I say we move ahead with this PR. Did you look at the output from the address sanitizer on the Github action run to ensure nothing weird is happening?

lifflander · 2020-06-09T06:40:53Z

So this has had the same spurious failure of test_lb_1_proc_2 as #826

We need to get to the bottom of this error.

…ated

…r epoch

…itable interval

…ing an action to the global epoch

lifflander · 2020-06-10T03:49:19Z

Here is an overview of what got changed by this pull request:

Clones removed
==============
+ examples/collection/lb_iter.cc  -2
+ examples/collection/migrate_collection.cc  -4

See the complete overview on Codacy

PhilMiller requested a review from lifflander May 26, 2020 15:17

PhilMiller commented May 26, 2020

View reviewed changes

src/vt/vrt/collection/balance/greedylb/greedylb.cc Show resolved Hide resolved

lifflander reviewed May 26, 2020

View reviewed changes

src/vt/vrt/collection/balance/greedylb/greedylb.cc Outdated Show resolved Hide resolved

PhilMiller force-pushed the 649-action-epoch branch from 0dbd148 to c6c1de8 Compare May 26, 2020 21:24

lifflander reviewed Jun 2, 2020

View reviewed changes

lifflander approved these changes Jun 2, 2020

View reviewed changes

PhilMiller changed the title ~~649 action epoch~~ Refactor code away from using low-level termination primitives, to higher-level routines Jun 2, 2020

PhilMiller marked this pull request as ready for review June 2, 2020 14:10

lifflander changed the title ~~Refactor code away from using low-level termination primitives, to higher-level routines~~ 649 Refactor code away from using low-level termination primitives, to higher-level routines Jun 9, 2020

lifflander changed the title ~~649 Refactor code away from using low-level termination primitives, to higher-level routines~~ Refactor code away from using low-level termination primitives, to higher-level routines Jun 9, 2020

Phil Miller added 4 commits June 9, 2020 20:46

Drop redundant push/pop epoch calls

f5253e7

Add method to run scheduler until termination of a specified epoch

ef4dfa2

Switch over flag-only addAction calls to runSchedulerThrough(epoch)

89ad331

Add missing finishedEpoch() call

02bc6ba

Phil Miller added 6 commits June 9, 2020 20:46

Allow remote rooted epochs to eventually indicate that they've Termin…

975e588

…ated

Block global termination while spinning the scheduler for a particula…

1448a85

…r epoch

BaseLB: check that strategies only try to migrate objects during a su…

22bf5db

…itable interval

Add functions to build and run an epoch around some Action

af0ec76

Examples: convert some to not make their own scheduler calls

f5680a1

examples/reduce_integral: Run work in a closed epoch, rather than add…

fde87f3

…ing an action to the global epoch

lifflander force-pushed the 649-action-epoch branch from e3781ec to fde87f3 Compare June 10, 2020 03:47

lifflander merged commit 26a049d into develop Jun 10, 2020

cz4rs mentioned this pull request Jun 15, 2021

move vt version number into a separate file #1467

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor code away from using low-level termination primitives, to higher-level routines #824

Refactor code away from using low-level termination primitives, to higher-level routines #824

PhilMiller commented May 26, 2020 •

edited by lifflander

Loading

PhilMiller commented May 26, 2020 •

edited

Loading

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

codecov bot commented May 26, 2020 •

edited

Loading

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 27, 2020

PhilMiller commented Jun 1, 2020

lifflander Jun 2, 2020

PhilMiller Jun 2, 2020

PhilMiller Jun 2, 2020

lifflander commented Jun 2, 2020

PhilMiller commented Jun 2, 2020 via email

PhilMiller commented Jun 2, 2020

lifflander commented Jun 2, 2020

lifflander commented Jun 9, 2020

lifflander commented Jun 10, 2020

Refactor code away from using low-level termination primitives, to higher-level routines #824

Refactor code away from using low-level termination primitives, to higher-level routines #824

Conversation

PhilMiller commented May 26, 2020 • edited by lifflander Loading

PhilMiller commented May 26, 2020 • edited Loading

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

codecov bot commented May 26, 2020 • edited Loading

Codecov Report

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 26, 2020

PhilMiller commented May 27, 2020

PhilMiller commented Jun 1, 2020

lifflander Jun 2, 2020

Choose a reason for hiding this comment

PhilMiller Jun 2, 2020

Choose a reason for hiding this comment

PhilMiller Jun 2, 2020

Choose a reason for hiding this comment

lifflander commented Jun 2, 2020

PhilMiller commented Jun 2, 2020 via email

PhilMiller commented Jun 2, 2020

lifflander commented Jun 2, 2020

lifflander commented Jun 9, 2020

lifflander commented Jun 10, 2020

PhilMiller commented May 26, 2020 •

edited by lifflander

Loading

PhilMiller commented May 26, 2020 •

edited

Loading

codecov bot commented May 26, 2020 •

edited

Loading