[RFC] Initial CPU MPI implementation #833

XapaJIaMnu · 2021-03-17T09:18:57Z

Description

This add initial CPU MPI support, fixing #744 the limitation is that you need to have only one cpu-thread per process. It also only supports "global" sharding mode (for now).

Are you guys upstream interested in a CPU MPI implementation? Comments on my implementation are much welcome.

How to test

mpirun -n X /path/to/marian -c config.yml as long as --cpu-threads: 1

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

snukky · 2021-03-19T16:01:57Z

Windows builds fail properly and need to be fixed: some warnings from the new code show up and we treat them as errors.

XapaJIaMnu · 2021-03-20T22:52:32Z

@snukky thank you. I was more so asking whether we are at all interested to have this and is it a problem that for now it has the limitation that it would only work with one thread per process and only in "global" gradient shard mode.

Also, we don't have any MPI tests, or we do?

emjotde · 2021-03-20T22:53:54Z

I guess having it doesn't hurt. Did you have a use case?

snukky · 2021-03-21T16:10:30Z

Also, we don't have any MPI tests, or we do?

No, I think we don't, but it would be great to have at least a unit test.

XapaJIaMnu · 2021-03-21T16:15:00Z

I guess having it doesn't hurt. Did you have a use case?

We're expecting some hardware that isn't made by nvidia.

emjotde · 2021-03-21T16:47:43Z

@XapaJIaMnu Oh interesting. BTW, local sharding for a setup like 8 GPUs and 4 nodes is about 25% faster.

XapaJIaMnu · 2021-03-22T15:37:09Z

@emjotde yeah I looked at it, that's what I expect to be more efficient, but due to differences between MPI and NCCL, implementing it is a bit more complicated and I've put it on the queue.

Cheers,

Nick

emjotde

Generally OK. Issues are mostly about comments. I am assuming this to be early code, so not insisting on combining with NCCLCommunicator (yet).

emjotde · 2021-03-23T22:30:02Z

src/training/communicator.cpp

@@ -25,6 +26,9 @@ Ptr<ICommunicator> createCommunicator(
  const std::vector<Ptr<ExpressionGraph>>& graphs,
  bool noNccl, ShardingMode shardingMode, Ptr<IMPIWrapper> mpi) {
  mpi;


Looks like mpi;' line is redundant now.

emjotde · 2021-03-23T22:35:19Z

src/training/communicator.h

+    sendbuf; recvbuf; count; datatype; op; comm; // unused in the fakeMPI wrapper
+    ABORT("ReduceScatter is only implemented when compiled with -DUSE_MPI=ON");
+  }
+  virtual void Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm = MPI_COMM_WORLD) const {


Casing: allGather?

emjotde · 2021-03-23T22:37:07Z

src/training/communicator.h

+  virtual void reduceScatter(const void * sendbuf, void * recvbuf, int * recvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm = MPI_COMM_WORLD) const {
+    sendbuf; recvbuf; recvcounts; datatype; op; comm; // unused in the fakeMPI wrapper
+    ABORT("ReduceScatter is only implemented when compiled with -DUSE_MPI=ON");
+  }


Since those are non-pure virtual functions with implementations, let's add empty lines around them for readability.

Similar below.

emjotde · 2021-03-23T22:39:47Z

src/training/communicator_mpi.h

+  size_t shardSize() const {
+    size_t numShards = shardingMode_ == ShardingMode::global ? numRanks() : numLocalRanks();
+    size_t size = (dataSize() + numShards - 1) / numShards;
+#if 1 // for now, all shards must have the same size, since NCCL does not allow a sub-slice for the last shard


Adapt comment to reflect that this was copied from the NCCL communicator.

Similar above and any other mention of NCCL in this file.

emjotde · 2021-03-23T22:42:42Z

src/training/communicator_mpi.h

+
+      MPI_Datatype mpiFLoatType = MPI_FLOAT;
+      if(grads->type() == Type::float16)
+        ABORT("Half precision is datatype is not supported by MPI.");


Grammar? One "is" too many?

emjotde · 2021-03-23T22:43:34Z

src/training/communicator_mpi.h

@@ -0,0 +1,324 @@
+// Note: This must only be included if defined(CUDA_FOUND) && defined(USE_NCCL)


emjotde · 2021-03-23T22:45:42Z

src/training/communicator_mpi.h

+  }
+};
+
+}  // namespace marian


Generally a ton of code duplication with NCCL communicator. Once you think about adding local sharding, might be worth to combine? Will make it easier to maintain and make sure that it does not go down the path of the other MPI implementations that ended up not being used and removed in the end.

emjotde · 2021-03-23T22:50:59Z

Unless you would like to start combining it with NCCL communicator already?

XapaJIaMnu · 2021-03-24T15:20:01Z

MPI communicator offers a subset of the functionality that NCCL does. When (eventually) we want to allow multiple threads per process, the implementation of the two big collective operation functions will start differing.

What do you propose?

I thought it made more sense to keep separate communicators, but I admit there is more code duplication than what I would like. I can try to inherit the NCCL communicator, but it has a lot of cuda specific stuff that would need to be hidden away. Furthermore, intel provices oneCCL, an abstraction over MPI, which supports more datatypes than MPI (But only works with intel MPI). I guess we're not interested in having more than one communication backends as there would be very few interested users.

I'm ok with delaying the merge until I have a more complete implementation, as the code is fairly self contained, and i find it unlikely any changes to master would break it. (Unless you are planning on some extra MPI work?)

XapaJIaMnu added 3 commits March 16, 2021 21:15

CPU mpi support

340e649

Don't use MPI_IN_PLACE

d0841ae

use MPI_IN_PLACE properly

8365126

snukky requested a review from emjotde March 19, 2021 16:02

Remove redundant barriers, update changelog and fix windows compilation

b2f0f39

emjotde self-assigned this Mar 23, 2021

emjotde approved these changes Mar 23, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Initial CPU MPI implementation #833

[RFC] Initial CPU MPI implementation #833

XapaJIaMnu commented Mar 17, 2021 •

edited

Loading

snukky commented Mar 19, 2021

XapaJIaMnu commented Mar 20, 2021

emjotde commented Mar 20, 2021

snukky commented Mar 21, 2021

XapaJIaMnu commented Mar 21, 2021

emjotde commented Mar 21, 2021

XapaJIaMnu commented Mar 22, 2021 •

edited

Loading

emjotde left a comment

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde Mar 23, 2021

emjotde commented Mar 23, 2021

XapaJIaMnu commented Mar 24, 2021

		@@ -0,0 +1,324 @@
		// Note: This must only be included if defined(CUDA_FOUND) && defined(USE_NCCL)

[RFC] Initial CPU MPI implementation #833

Are you sure you want to change the base?

[RFC] Initial CPU MPI implementation #833

Conversation

XapaJIaMnu commented Mar 17, 2021 • edited Loading

Description

How to test

Checklist

snukky commented Mar 19, 2021

XapaJIaMnu commented Mar 20, 2021

emjotde commented Mar 20, 2021

snukky commented Mar 21, 2021

XapaJIaMnu commented Mar 21, 2021

emjotde commented Mar 21, 2021

XapaJIaMnu commented Mar 22, 2021 • edited Loading

emjotde left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emjotde commented Mar 23, 2021

XapaJIaMnu commented Mar 24, 2021

XapaJIaMnu commented Mar 17, 2021 •

edited

Loading

XapaJIaMnu commented Mar 22, 2021 •

edited

Loading