-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
875 Implement phase manager component #1123
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1123 +/- ##
===========================================
- Coverage 76.03% 75.92% -0.12%
===========================================
Files 712 713 +1
Lines 27101 26940 -161
===========================================
- Hits 20606 20453 -153
+ Misses 6495 6487 -8
|
@@ -148,6 +149,15 @@ void Trace::startup() /*override*/ { | |||
theSched()->registerTrigger( | |||
sched::SchedulerEvent::EndIdle, [this]{ endIdle(); } | |||
); | |||
|
|||
thePhase()->registerHookRooted(phase::PhaseHook::End, []{ | |||
auto const phase = thePhase()->getCurrentPhase(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the hook interface should pass the phase as an argument? It could be overloaded to support hook that take it, or take nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity of just an ActionType
since it's so easy to call getCurrentPhase
}); | ||
|
||
thePhase()->registerHookCollective(phase::PhaseHook::EndPostMigration, []{ | ||
theTrace()->flushTracesFile(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above vein, I'm doubtful we should actually be flushing every phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually flush unless the logic in the trace is activated. See flushTracesFile
. It checks the size and some other conditions.
Now, I do agree it might be wasteful to spend a collective epoch on something that might do nothing every phase. But, they are pretty cheap anyway.
} | ||
}); | ||
|
||
// Second hook, select and then potentially start the LB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we defining the semantics that hooks for a given occasion run in the order they were inserted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see the comments in the PhaseManager
. registerHookCollective
is collective and guarantees that the hooks are executed in the order in which they are registered.
@@ -1979,6 +1980,21 @@ template <typename ColT, typename... Args> | |||
* messages can be forwarded properly | |||
*/ | |||
theLocMan()->getCollectionLM<ColT,IndexType>(proxy); | |||
|
|||
/** | |||
* Type-erase some lambdas for doing the collective broadcast that collects up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this even need to be a type-erased lambda? Every collection derives indirectly from ElementsStats
, and with the simplified logic that just calls addNodeStats
, I don't think syncNextPhase
still needs to be templated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, there's no universal holder for collection elements. You have to know IndexT to send a message to them even if that message targets something like Migratable
. In the future, we can reduce those requirements as we de-template collection elements.
thePhase()->registerHookCollective(phase::PhaseHook::End, []{ | ||
auto const cur_phase = thePhase()->getCurrentPhase(); | ||
theLBManager()->selectStartLB(cur_phase); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reviewing LBManager
, it threw me off a bit to see the function defined there, but not registered there. I understand why that is, but it's a bit disorienting to have it here nonetheless.
If we want to be fancy, we could perhaps give names to the various hooks, define dependencies between them, and rely on the phase manager to run them in a suitable order based on a topological sort, with synchronization between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had that, would that also obviate the distinction between PhaseHook::End
and PhaseHook::EndPostMigration
?
// Start with a reduction to sure all nodes are ready for this | ||
auto cb = theCB()->makeBcast<PhaseManager, NextMsg, &PhaseManager::nextPhaseReduce>(proxy); | ||
auto msg = makeMessage<NextMsg>(); | ||
proxy.reduce(msg.get(), cb); | ||
|
||
theSched()->runSchedulerWhile([this]{ return not reduce_next_phase_done_; }); | ||
reduce_next_phase_done_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically just a barrier()
on a private collective scope. Do we want to just add that operation, rather than open-coding it here (and possibly duplicated elsewhere)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the semantic would be different. Barriers generally have global ordering to them and are shared across components. The reduce used here is scoped within the PhaseManager
due to reduce scoping rules per component in VT (each component has its own scope by default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you didn't mean use theCollective()->barrier()
, you meant to refactor this as a pattern templated on an objgroup proxy (or maybe any proxy). That seems reasonable for follow-on work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's refactoring the scoped "reduce to a (logical) no-op handler and spin on that" into an objgroup proxy method. Want to open that issue and resolve this comment?
This all looks really good. Makes the runtime more powerful and flexible, and cuts 400 lines of code. Typos and naming aside, nothing I noted should really block merging this. I do think making ordering more explicit, rather than the implicit "in order of registration", would be of benefit though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, nothing more to add than what Phil wrote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Here is an overview of what got changed by this pull request: Clones added
============
- examples/collection/lb_iter.cc 1
- tests/unit/collection/test_lb.extended.cc 2
- tests/unit/collection/test_lb_lite.extended.cc 1
See the complete overview on Codacy |
Fixes #875
PhaseManager
component