Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma: modify attach to check for region overlap #7387

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Feb 12, 2020

This commit addresses two issues in osc/rdma:

  1. It is erroneous to attach regions that overlap. This was being
    allowed but the standard does not allow overlapping attachments.

  2. Overlapping registration regions (4k alignment of attachments)
    appear to be allowed. Add attachment bases to the bookeeping
    structure so we can keep better track of what can be detached.

It is possible that the standard did not intend to allow #2. If that
is the case then #2 should fail in the same way as #1. There should
be no technical reason to disallow #2 at this time.

References #7384

Signed-off-by: Nathan Hjelm hjelmn@google.com

@hjelmn
Copy link
Member Author

hjelmn commented Feb 12, 2020

@DmitryLyakh should fix the issue. Keep in mind to run the app you will have to increase the maximum number of attachments. The default was 32 (now 64) but in order to run the app from #7384 you need to bump that to 128 (--mca osc_rdma_max_attach 128). You make want to modify the code to use fewer attachments as memory registrations can be a limited resource. As an example Cray Aries allows ~ 2000 registered regions per node. A portion of these are needed for two-sided communication, buffers, etc so chewing up 128 per process for attachments could be a problem.

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the dynamic attachment doc again, I agree that MPI-3.1 appears to allow (2). I guess that makes sense, but also makes my head hurt.

@@ -218,7 +219,7 @@ static int ompi_osc_rdma_component_register (void)
MCA_BASE_VAR_SCOPE_LOCAL, &mca_osc_rdma_component.buffer_size);
free(description_str);

mca_osc_rdma_component.max_attach = 32;
mca_osc_rdma_component.max_attach = 64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call out this change in the commit log? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably break it out into a different commit. I realized that 32, while safer for Cray systems, is probably too low for most use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hjelmn hjelmn force-pushed the osc_rdma_allow_overlapping_registration_regions_and_return_the_correct_error_code_when_regions_overlap branch from ec331c7 to 96ed630 Compare February 14, 2020 03:56
This commit addresses two issues in osc/rdma:

 1) It is erroneous to attach regions that overlap. This was being
    allowed but the standard does not allow overlapping attachments.

 2) Overlapping registration regions (4k alignment of attachments)
    appear to be allowed. Add attachment bases to the bookeeping
    structure so we can keep better track of what can be detached.

It is possible that the standard did not intend to allow #2. If that
is the case then #2 should fail in the same way as #1. There should
be no technical reason to disallow #2 at this time.

References open-mpi#7384

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
This commit increaes the osc_rdma_max_attach variable from 32
to 64. The new default is kept low due to the small number
of registration resources on some systems (Cray Aries). A
larger max attachement value can be set by the user on other
systems.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
@hjelmn hjelmn force-pushed the osc_rdma_allow_overlapping_registration_regions_and_return_the_correct_error_code_when_regions_overlap branch from 96ed630 to 54c8233 Compare February 17, 2020 01:15
@hjelmn hjelmn merged commit eeb3d7f into open-mpi:master Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants