-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.1 mpool/rcache rewrite #2101
Conversation
This reverts commit 3912271.
This reverts commit 3e4ff53.
This reverts commit 756764c.
This reverts commit 6c95223.
This reverts commit 78f4315.
This reverts commit 48072aa.
This means we need not check for jemalloc in the configure script for this component. Removing this. In some machines having the TLS option on can cause errors in opening this component. --disable-tls while configuring jemalloc. Please look for instructions for installing jemalloc as a static library linked directly into memkind in CONTRIBUTING file github.com/memkind/memkindw
This commit rewrites both the mpool and rcache frameworks. Summary of changes: - Before this change a significant portion of the rcache functionality lived in mpool components. This meant that it was impossible to add a new memory pool to use with rdma networks (ugni, openib, etc) without duplicating the functionality of an existing mpool component. All the registration functionality has been removed from the mpool and placed in the rcache framework. - All registration cache mpools components (udreg, grdma, gpusm, rgpusm) have been changed to rcache components. rcaches are allocated and released in the same way mpool components were. - It is now valid to pass NULL as the resources argument when creating an rcache. At this time the gpusm and rgpusm components support this. All other rcache components require non-NULL resources. - A new mpool component has been added: hugepage. This component supports huge page allocations on linux. - Memory pools are now allocated using "hints". Each mpool component is queried with the hints and returns a priority. The current hints supported are NULL (uses posix_memalign/malloc), page_size=x (huge page mpool), and mpool=x. - The sm mpool has been moved to common/sm. This reflects that the sm mpool is specialized and not meant for any general allocations. This mpool may be moved back into the mpool framework if there is any objection. - The opal_free_list_init arguments have been updated. The unused0 argument is not used to pass in the registration cache module. The mpool registration flags are now rcache registration flags. - All components have been updated to make use of the new framework interfaces. As this commit makes significant changes to both the mpool and rcache frameworks both versions have been bumped to 3.0.0. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This typo was originally fixed on the mpool_rewrite branch but the change was lost. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1356358: Null pointer dereferences (REVERSE_INULL): flist->fl_mpool can no longer be NULL. Removed the conditional. Fix CID 1356357: Resource leaks (RESOURCE_LEAK): Added the call to free the hints array. Fix CID 1356356: Resource leaks (RESOURCE_LEAK): This is a false error but it is safe to call close (-1) so just always call close. Fix CID 1356354: Control flow issues (MISSING_BREAK): Fix CID 1356353: Control flow issues (MISSING_BREAK): Add comments that indicate the fall-through is intentional. Fix CID 1356351: Null pointer dereferences (FORWARD_NULL): Fix potential SEGV if the page_size key is malformed. Fix CID 1356350: Error handling issues (CHECKED_RETURN): Add (void) to indicate that we do not care about the return code of sscanf in this case. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix CID 1345825 (1 of 1): Dereference before null check (REVERSE_INULL): ib_proc should not be NULL in this case. Removed the check and added a check for NULL after OBJ_NEW. CID 1269821 (1 of 1): Dereference null return value (NULL_RETURNS): I labeled this one as a false positive (which it is) but the code in question could stand be be cleaned up. Fix CID 1356424 (1 of 1): Argument cannot be negative (NEGATIVE_RETURNS): While trying to silence another Coverity issue another was flagged. Protect the close of fd with if (fd >= 0). CID 70772 (1 of 1): Dereference null return value (NULL_RETURNS): CID 70773 (1 of 1): Dereference null return value (NULL_RETURNS): CID 70774 (1 of 1): Dereference null return value (NULL_RETURNS): None of these are errors and are intentional but now that we have a list release function use that to make these go away. The cleanup is similar to CID 1269821. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit brings the scif btl up to date with changes made on master to rework the mpool and rcache frameworks. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes bugs that caused hangs or crashes when running out of registration resources. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Thanks Siegmar Gross for the report.
Fixes open-mpi#1545 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Fixes open-mpi#1654. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fixes open-mpi#1702 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On Cray, PR open-mpi#1846 introduced a double free situation which led to all kinds of random memory corruption problems. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Looks like I missed one. |
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per discussion on the call today, we encourage the community to check out this PR:
If this PR breaks things, it's likely because it's inadvertently missing some master commits. This PR is the first in a two-step process: the next will be to bring over the MCA flag enumerator stuff, which will then enable a small number of other PRs to come over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run a bunch of tests and this seems ok to me. 👍
This PR reverts several spot fixes made to the v2.0.x release series then applies the mpool rewrite and the associated fixes. This PR should be complete but will need some testing to make sure I didn't miss anything.