Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

875 Implement phase manager component #1123

Merged
merged 36 commits into from
Oct 23, 2020
Merged

Conversation

lifflander
Copy link
Collaborator

@lifflander lifflander commented Oct 20, 2020

Fixes #875

  • Implement the new PhaseManager component
  • Remove a ton of code for the old, complex management of phases
  • Adapt examples and tests to new interface
  • Write documentation on new interface

@codecov
Copy link

codecov bot commented Oct 20, 2020

Codecov Report

Merging #1123 into develop will decrease coverage by 0.11%.
The diff coverage is 88.46%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1123      +/-   ##
===========================================
- Coverage    76.03%   75.92%   -0.12%     
===========================================
  Files          712      713       +1     
  Lines        27101    26940     -161     
===========================================
- Hits         20606    20453     -153     
+ Misses        6495     6487       -8     
Impacted Files Coverage Δ
src/vt/configs/arguments/app_config.h 100.00% <ø> (ø)
src/vt/configs/debug/debug_config.h 85.71% <ø> (ø)
src/vt/rdmahandle/manager.collection.impl.h 0.00% <0.00%> (ø)
src/vt/runtime/runtime.h 100.00% <ø> (ø)
src/vt/utils/memory/memory_usage.h 90.90% <ø> (ø)
src/vt/vrt/collection/balance/elm_stats.h 100.00% <ø> (ø)
...c/vt/vrt/collection/balance/lb_invoke/lb_manager.h 100.00% <ø> (ø)
src/vt/vrt/collection/balance/node_stats.cc 71.02% <ø> (-3.71%) ⬇️
src/vt/vrt/collection/balance/node_stats.h 100.00% <ø> (ø)
src/vt/vrt/collection/holders/base_holder.h 100.00% <ø> (ø)
... and 67 more

@lifflander lifflander marked this pull request as ready for review October 22, 2020 05:57
@lifflander lifflander requested a review from nmm0 October 22, 2020 18:39
@@ -148,6 +149,15 @@ void Trace::startup() /*override*/ {
theSched()->registerTrigger(
sched::SchedulerEvent::EndIdle, [this]{ endIdle(); }
);

thePhase()->registerHookRooted(phase::PhaseHook::End, []{
auto const phase = thePhase()->getCurrentPhase();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the hook interface should pass the phase as an argument? It could be overloaded to support hook that take it, or take nothing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of just an ActionType since it's so easy to call getCurrentPhase

});

thePhase()->registerHookCollective(phase::PhaseHook::EndPostMigration, []{
theTrace()->flushTracesFile(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above vein, I'm doubtful we should actually be flushing every phase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually flush unless the logic in the trace is activated. See flushTracesFile. It checks the size and some other conditions.

Now, I do agree it might be wasteful to spend a collective epoch on something that might do nothing every phase. But, they are pretty cheap anyway.

}
});

// Second hook, select and then potentially start the LB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we defining the semantics that hooks for a given occasion run in the order they were inserted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see the comments in the PhaseManager. registerHookCollective is collective and guarantees that the hooks are executed in the order in which they are registered.

@@ -1979,6 +1980,21 @@ template <typename ColT, typename... Args>
* messages can be forwarded properly
*/
theLocMan()->getCollectionLM<ColT,IndexType>(proxy);

/**
* Type-erase some lambdas for doing the collective broadcast that collects up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this even need to be a type-erased lambda? Every collection derives indirectly from ElementsStats, and with the simplified logic that just calls addNodeStats, I don't think syncNextPhase still needs to be templated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, there's no universal holder for collection elements. You have to know IndexT to send a message to them even if that message targets something like Migratable. In the future, we can reduce those requirements as we de-template collection elements.

thePhase()->registerHookCollective(phase::PhaseHook::End, []{
auto const cur_phase = thePhase()->getCurrentPhase();
theLBManager()->selectStartLB(cur_phase);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reviewing LBManager, it threw me off a bit to see the function defined there, but not registered there. I understand why that is, but it's a bit disorienting to have it here nonetheless.

If we want to be fancy, we could perhaps give names to the various hooks, define dependencies between them, and rely on the phase manager to run them in a suitable order based on a topological sort, with synchronization between.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had that, would that also obviate the distinction between PhaseHook::End and PhaseHook::EndPostMigration?

Comment on lines +124 to +130
// Start with a reduction to sure all nodes are ready for this
auto cb = theCB()->makeBcast<PhaseManager, NextMsg, &PhaseManager::nextPhaseReduce>(proxy);
auto msg = makeMessage<NextMsg>();
proxy.reduce(msg.get(), cb);

theSched()->runSchedulerWhile([this]{ return not reduce_next_phase_done_; });
reduce_next_phase_done_ = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically just a barrier() on a private collective scope. Do we want to just add that operation, rather than open-coding it here (and possibly duplicated elsewhere)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the semantic would be different. Barriers generally have global ordering to them and are shared across components. The reduce used here is scoped within the PhaseManager due to reduce scoping rules per component in VT (each component has its own scope by default).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you didn't mean use theCollective()->barrier(), you meant to refactor this as a pattern templated on an objgroup proxy (or maybe any proxy). That seems reasonable for follow-on work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's refactoring the scoped "reduce to a (logical) no-op handler and spin on that" into an objgroup proxy method. Want to open that issue and resolve this comment?

@PhilMiller
Copy link
Member

This all looks really good. Makes the runtime more powerful and flexible, and cuts 400 lines of code. Typos and naming aside, nothing I noted should really block merging this. I do think making ordering more explicit, rather than the implicit "in order of registration", would be of benefit though.

Copy link
Contributor

@cz4rs cz4rs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, nothing more to add than what Phil wrote

Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Collaborator Author

Codacy Here is an overview of what got changed by this pull request:

Clones added
============
- examples/collection/lb_iter.cc  1
- tests/unit/collection/test_lb.extended.cc  2
- tests/unit/collection/test_lb_lite.extended.cc  1
         

See the complete overview on Codacy

@lifflander lifflander merged commit 01ccf10 into develop Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement PhaseManager to refactor functionality out of LBManager
4 participants