-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Initial CPU MPI implementation #833
base: master
Are you sure you want to change the base?
Conversation
Windows builds fail properly and need to be fixed: some warnings from the new code show up and we treat them as errors. |
@snukky thank you. I was more so asking whether we are at all interested to have this and is it a problem that for now it has the limitation that it would only work with one thread per process and only in "global" gradient shard mode. Also, we don't have any MPI tests, or we do? |
I guess having it doesn't hurt. Did you have a use case? |
No, I think we don't, but it would be great to have at least a unit test. |
We're expecting some hardware that isn't made by nvidia. |
@XapaJIaMnu Oh interesting. BTW, local sharding for a setup like 8 GPUs and 4 nodes is about 25% faster. |
@emjotde yeah I looked at it, that's what I expect to be more efficient, but due to differences between MPI and NCCL, implementing it is a bit more complicated and I've put it on the queue. Cheers, Nick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally OK. Issues are mostly about comments. I am assuming this to be early code, so not insisting on combining with NCCLCommunicator (yet).
@@ -25,6 +26,9 @@ Ptr<ICommunicator> createCommunicator( | |||
const std::vector<Ptr<ExpressionGraph>>& graphs, | |||
bool noNccl, ShardingMode shardingMode, Ptr<IMPIWrapper> mpi) { | |||
mpi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like mpi;'
line is redundant now.
sendbuf; recvbuf; count; datatype; op; comm; // unused in the fakeMPI wrapper | ||
ABORT("ReduceScatter is only implemented when compiled with -DUSE_MPI=ON"); | ||
} | ||
virtual void Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm = MPI_COMM_WORLD) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casing: allGather
?
virtual void reduceScatter(const void * sendbuf, void * recvbuf, int * recvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm = MPI_COMM_WORLD) const { | ||
sendbuf; recvbuf; recvcounts; datatype; op; comm; // unused in the fakeMPI wrapper | ||
ABORT("ReduceScatter is only implemented when compiled with -DUSE_MPI=ON"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since those are non-pure virtual functions with implementations, let's add empty lines around them for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar below.
size_t shardSize() const { | ||
size_t numShards = shardingMode_ == ShardingMode::global ? numRanks() : numLocalRanks(); | ||
size_t size = (dataSize() + numShards - 1) / numShards; | ||
#if 1 // for now, all shards must have the same size, since NCCL does not allow a sub-slice for the last shard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapt comment to reflect that this was copied from the NCCL communicator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar above and any other mention of NCCL in this file.
|
||
MPI_Datatype mpiFLoatType = MPI_FLOAT; | ||
if(grads->type() == Type::float16) | ||
ABORT("Half precision is datatype is not supported by MPI."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar? One "is" too many?
@@ -0,0 +1,324 @@ | |||
// Note: This must only be included if defined(CUDA_FOUND) && defined(USE_NCCL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated?
} | ||
}; | ||
|
||
} // namespace marian |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally a ton of code duplication with NCCL communicator. Once you think about adding local sharding, might be worth to combine? Will make it easier to maintain and make sure that it does not go down the path of the other MPI implementations that ended up not being used and removed in the end.
Unless you would like to start combining it with NCCL communicator already? |
MPI communicator offers a subset of the functionality that NCCL does. When (eventually) we want to allow multiple threads per process, the implementation of the two big collective operation functions will start differing. What do you propose? I thought it made more sense to keep separate communicators, but I admit there is more code duplication than what I would like. I can try to inherit the NCCL communicator, but it has a lot of cuda specific stuff that would need to be hidden away. Furthermore, intel provices oneCCL, an abstraction over MPI, which supports more datatypes than MPI (But only works with intel MPI). I guess we're not interested in having more than one communication backends as there would be very few interested users. I'm ok with delaying the merge until I have a more complete implementation, as the code is fairly self contained, and i find it unlikely any changes to master would break it. (Unless you are planning on some extra MPI work?) |
Description
This add initial CPU MPI support, fixing #744 the limitation is that you need to have only one cpu-thread per process. It also only supports "global" sharding mode (for now).
Are you guys upstream interested in a CPU MPI implementation? Comments on my implementation are much welcome.
How to test
mpirun -n X /path/to/marian -c config.yml
as long as--cpu-threads: 1
Checklist