-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State-less MPI rank management #103
Conversation
…s could potentially allow us to remove the barrier from MPI_init
@@ -0,0 +1,13 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the implementation of these is so simple, that they are alright with just being abstract methods. However, I thought they belonged in transport
rather than scheduler
.
const std::string actualHost = world.getHostForRank(1); | ||
REQUIRE(actualHost == expectedHost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We weren't checking anything here.
26d57ac
to
63d7bd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, the use of state for MPI has always been very fiddly.
A few style/ naming/ structure comments but nothing fundamental.
@@ -8,3 +8,5 @@ | |||
#define MPI_MESSAGE_PORT 8005 | |||
#define SNAPSHOT_PORT 8006 | |||
#define REPLY_PORT_OFFSET 100 | |||
|
|||
#define MPI_PORT 8800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the MPI_PORT
different to the existing MPI_MESSAGE_PORT
and how come it's so much higher in the port range than the others? It would be cleanest to keep the range of ports we use as narrow as possible (e.g. could this just be 8007
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After our offline discussion, I'll leave the port like this as it will change in coming PRs.
src/scheduler/MpiWorld.cpp
Outdated
logger->error("Error emplacing in rankHostMap: {} -> {}", | ||
i + 1, | ||
rankHostVec.at(i)); | ||
throw std::runtime_error("Error emplacing in rankHostMap"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the use of this pattern with try_emplace
instead of just emplace
. What situation is it guarding against? It's adding 5 lines of error handling code, and even when it fails the message isn't particularly clear. Much like avoiding variable names in comments, it's not good to include them in error messages, as this will be even more confusing if the variable name changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try_emplace
might be an overkill here indeed. It does the same as emplace
(and has the same return value), but the the element is not constructed if the key is already there (again, not important with string
s and int
s). In short, emplace
and try_emplace
guard against the same behaviour: trying to insert a key that already exists.
The error message was not clear at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I understand what try_emplace
does, but what situation is it guarding against here? It seems like this will only error if we're setting a value for a rank that has already been set. AFAICT this won't happen as it's only called once in MPI initialisation right? Even if it did somehow happen (e.g. if someone changes the code in future), it's probably not worthy of an error, at most a warning to say that it's getting reset. Either way I think just using emplace
is fine and much clearer.
@@ -1051,16 +1058,13 @@ template void doReduceTest<double>(scheduler::MpiWorld& world, | |||
|
|||
TEST_CASE("Test reduce", "[mpi]") | |||
{ | |||
cleanFaabric(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add a fixture to this file as we also have a custom tearDown
method. One for another PR probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one 👍
In this PR I remove the dependency with
state
for the rank-host map bookkeeping. This was something I had in my to-do list as it really wasn't needed. Even though this does not have a direct impact on performance, it opens the door to rank-to-rank communication, I elaborate.Before, upon
MPI_Init
all ranks would halt at a barrier. Non-zero ranks wouldrecv
a broadcast message from the root rank (0), and bind to their local queue (not knowing which host the message came from). Only when trying to send a response afterwards, they'd queryredis
for the host where the root rank was placed. If we were to queryredis
before the firstrecv
(to, for instance,recv
on a per-rank port) we would be racing the root rank and its update of the rank-host map.Given the socket-like tools introduced with zeromq, we can just send the map around to the hosts that need it. This is greatly facilitated by the fact that the scheduler returns the list of hosts where it has scheduled the calls.
The only part of the API that now needs the state are the RMA calls, which I haven't worked with at all, but quite keen to do so when they are needed.
RIP
registerRank
💀