Enhance performance when (un)packing derived datatypes #6677

derbeyn · 2019-05-17T06:42:49Z

When changing its application to use MPI derived datatypes, one of our major clients noticed an important performance degradation during the data exchanges.
One of our studends did his internship on this topic 1 year ago.
He noticed that the degradation was coming in part from indexed DTT packing/unpacking.

Using an intel compiler functionality, he could substancially reduce some indexed DTT exchanges: specifying a fixed size to the mempcy() makes the intel compiler inline the call as an assignment operation (given the size corresponds to the size of an int (4 bytes) or to the size of a double (8 bytes). Doing so, when an indexed DTT is made of a big number of MPI_DOUBLEs, for example, a series of assignments is done instead of a series of memcpy() operations. This implies a big performance improvement.

The proposed code is a bit hacky, but doing this we can divide the latency by up to 2 during DDT exechanges.

The patch is proposed on master, as usual, but it has only been tested on v3.x.

The following graph shows the latency improvement we could get for several kinds of transfers

0: latency exchanging column 0 of a 4K x 4K matrix of MPI_DOUBLES
1: latency exchanging column 2047 of a 4K x 4K matrix of MPI_DOUBLES
2: latency exchanging column 4095 of a 4K x 4K matrix of MPI_DOUBLES
3: latency exchanging diagonal 0 of a 4K x 4K matrix of MPI_DOUBLES
4: latency exchanging diagonal 2047 of a 4K x 4K matrix of MPI_DOUBLES
5: latency exchanging diagonal 4095 of a 4K x 4K matrix of MPI_DOUBLES
6: latency exchanging diagonal -2047 of a 4K x 4K matrix of MPI_DOUBLES
7: latency exchanging diagonal -4095 of a 4K x 4K matrix of MPI_DOUBLES

diag0: form top-left corner to bottom-right cirner (matrix[4095][4095])

0 diag number = towards up - right
< 0 diag number = towards down - left

In this graph, you will notice a third series of measurements (ompi new_assign): it corresponds to @ggouaillardet proposal to change some memcpy() operations to assignements #6617.

bosilca · 2019-06-21T17:22:32Z

The underlying issue is now addressed in #6695.

derbeyn mentioned this issue May 17, 2019

Force memcpy inlining to assignments during pack/unpack of some DDTs #6678

Closed

jsquyres assigned bosilca May 17, 2019

bosilca closed this as completed Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance performance when (un)packing derived datatypes #6677

Enhance performance when (un)packing derived datatypes #6677

derbeyn commented May 17, 2019

bosilca commented Jun 21, 2019

Enhance performance when (un)packing derived datatypes #6677

Enhance performance when (un)packing derived datatypes #6677

Comments

derbeyn commented May 17, 2019

bosilca commented Jun 21, 2019