Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Initial CPU MPI implementation #833

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

XapaJIaMnu
Copy link
Contributor

@XapaJIaMnu XapaJIaMnu commented Mar 17, 2021

Description

This add initial CPU MPI support, fixing #744 the limitation is that you need to have only one cpu-thread per process. It also only supports "global" sharding mode (for now).

Are you guys upstream interested in a CPU MPI implementation? Comments on my implementation are much welcome.

How to test

mpirun -n X /path/to/marian -c config.yml as long as --cpu-threads: 1

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

@snukky
Copy link
Member

snukky commented Mar 19, 2021

Windows builds fail properly and need to be fixed: some warnings from the new code show up and we treat them as errors.

@snukky snukky requested a review from emjotde March 19, 2021 16:02
@XapaJIaMnu
Copy link
Contributor Author

@snukky thank you. I was more so asking whether we are at all interested to have this and is it a problem that for now it has the limitation that it would only work with one thread per process and only in "global" gradient shard mode.

Also, we don't have any MPI tests, or we do?

@emjotde
Copy link
Member

emjotde commented Mar 20, 2021

I guess having it doesn't hurt. Did you have a use case?

@snukky
Copy link
Member

snukky commented Mar 21, 2021

Also, we don't have any MPI tests, or we do?

No, I think we don't, but it would be great to have at least a unit test.

@XapaJIaMnu
Copy link
Contributor Author

I guess having it doesn't hurt. Did you have a use case?

We're expecting some hardware that isn't made by nvidia.

@emjotde
Copy link
Member

emjotde commented Mar 21, 2021

@XapaJIaMnu Oh interesting. BTW, local sharding for a setup like 8 GPUs and 4 nodes is about 25% faster.

@XapaJIaMnu
Copy link
Contributor Author

XapaJIaMnu commented Mar 22, 2021

@emjotde yeah I looked at it, that's what I expect to be more efficient, but due to differences between MPI and NCCL, implementing it is a bit more complicated and I've put it on the queue.

Cheers,

Nick

@emjotde emjotde self-assigned this Mar 23, 2021
Copy link
Member

@emjotde emjotde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally OK. Issues are mostly about comments. I am assuming this to be early code, so not insisting on combining with NCCLCommunicator (yet).

@@ -25,6 +26,9 @@ Ptr<ICommunicator> createCommunicator(
const std::vector<Ptr<ExpressionGraph>>& graphs,
bool noNccl, ShardingMode shardingMode, Ptr<IMPIWrapper> mpi) {
mpi;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like mpi;' line is redundant now.

sendbuf; recvbuf; count; datatype; op; comm; // unused in the fakeMPI wrapper
ABORT("ReduceScatter is only implemented when compiled with -DUSE_MPI=ON");
}
virtual void Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm = MPI_COMM_WORLD) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casing: allGather?

virtual void reduceScatter(const void * sendbuf, void * recvbuf, int * recvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm = MPI_COMM_WORLD) const {
sendbuf; recvbuf; recvcounts; datatype; op; comm; // unused in the fakeMPI wrapper
ABORT("ReduceScatter is only implemented when compiled with -DUSE_MPI=ON");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since those are non-pure virtual functions with implementations, let's add empty lines around them for readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar below.

size_t shardSize() const {
size_t numShards = shardingMode_ == ShardingMode::global ? numRanks() : numLocalRanks();
size_t size = (dataSize() + numShards - 1) / numShards;
#if 1 // for now, all shards must have the same size, since NCCL does not allow a sub-slice for the last shard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapt comment to reflect that this was copied from the NCCL communicator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar above and any other mention of NCCL in this file.


MPI_Datatype mpiFLoatType = MPI_FLOAT;
if(grads->type() == Type::float16)
ABORT("Half precision is datatype is not supported by MPI.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar? One "is" too many?

@@ -0,0 +1,324 @@
// Note: This must only be included if defined(CUDA_FOUND) && defined(USE_NCCL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated?

}
};

} // namespace marian
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally a ton of code duplication with NCCL communicator. Once you think about adding local sharding, might be worth to combine? Will make it easier to maintain and make sure that it does not go down the path of the other MPI implementations that ended up not being used and removed in the end.

@emjotde
Copy link
Member

emjotde commented Mar 23, 2021

Unless you would like to start combining it with NCCL communicator already?

@XapaJIaMnu
Copy link
Contributor Author

MPI communicator offers a subset of the functionality that NCCL does. When (eventually) we want to allow multiple threads per process, the implementation of the two big collective operation functions will start differing.

What do you propose?

I thought it made more sense to keep separate communicators, but I admit there is more code duplication than what I would like. I can try to inherit the NCCL communicator, but it has a lot of cuda specific stuff that would need to be hidden away. Furthermore, intel provices oneCCL, an abstraction over MPI, which supports more datatypes than MPI (But only works with intel MPI). I guess we're not interested in having more than one communication backends as there would be very few interested users.

I'm ok with delaying the merge until I have a more complete implementation, as the code is fairly self contained, and i find it unlikely any changes to master would break it. (Unless you are planning on some extra MPI work?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants