Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance performance when (un)packing derived datatypes #6677

Closed
derbeyn opened this issue May 17, 2019 · 1 comment
Closed

Enhance performance when (un)packing derived datatypes #6677

derbeyn opened this issue May 17, 2019 · 1 comment
Assignees

Comments

@derbeyn
Copy link
Contributor

derbeyn commented May 17, 2019

When changing its application to use MPI derived datatypes, one of our major clients noticed an important performance degradation during the data exchanges.
One of our studends did his internship on this topic 1 year ago.
He noticed that the degradation was coming in part from indexed DTT packing/unpacking.

Using an intel compiler functionality, he could substancially reduce some indexed DTT exchanges: specifying a fixed size to the mempcy() makes the intel compiler inline the call as an assignment operation (given the size corresponds to the size of an int (4 bytes) or to the size of a double (8 bytes). Doing so, when an indexed DTT is made of a big number of MPI_DOUBLEs, for example, a series of assignments is done instead of a series of memcpy() operations. This implies a big performance improvement.

The proposed code is a bit hacky, but doing this we can divide the latency by up to 2 during DDT exechanges.

The patch is proposed on master, as usual, but it has only been tested on v3.x.

The following graph shows the latency improvement we could get for several kinds of transfers

image

0: latency exchanging column 0 of a 4K x 4K matrix of MPI_DOUBLES
1: latency exchanging column 2047 of a 4K x 4K matrix of MPI_DOUBLES
2: latency exchanging column 4095 of a 4K x 4K matrix of MPI_DOUBLES
3: latency exchanging diagonal 0 of a 4K x 4K matrix of MPI_DOUBLES
4: latency exchanging diagonal 2047 of a 4K x 4K matrix of MPI_DOUBLES
5: latency exchanging diagonal 4095 of a 4K x 4K matrix of MPI_DOUBLES
6: latency exchanging diagonal -2047 of a 4K x 4K matrix of MPI_DOUBLES
7: latency exchanging diagonal -4095 of a 4K x 4K matrix of MPI_DOUBLES

diag0: form top-left corner to bottom-right cirner (matrix[4095][4095])

0 diag number = towards up - right
< 0 diag number = towards down - left

In this graph, you will notice a third series of measurements (ompi new_assign): it corresponds to @ggouaillardet proposal to change some memcpy() operations to assignements #6617.

@bosilca
Copy link
Member

bosilca commented Jun 21, 2019

The underlying issue is now addressed in #6695.

@bosilca bosilca closed this as completed Jun 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants