-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oshmem/shmem: Allocate and exchange base segment address beforehand #12889
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Thomas Vegas <tvegas@nvidia.com>
071169b
to
019badb
Compare
#endif | ||
} | ||
|
||
if (mca_sshmem_base_start_address != memheap_mmap_get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is based on mmap() behavior where it always creates vma at the hint position if possible. If this not always true (kernel vesions..), this could regress existing behavior and even fail to honor command line parameter.
Shall we remove that confirmation check and proceed regardless? Or maybe only ignore that check when address was passed from command line?
@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf, | |||
/* init the contents of map_segment_t */ | |||
shmem_ds_reset(ds_buf); | |||
|
|||
(void)munmap(mca_sshmem_base_start_address, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why added then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now "reserve" that area by holding an mmap() on it as it seems there is no randomization between mmap/munmap + mmap sequence and area could be consumed by unrelated mmap() in between.
Then on the modules we "overwrite" it with (ucp_mem_map() / mmap() / shmat()). It's a try to make it explicit, although it opens for race and mmap() anyways replaces it with MAP_FIXED
.
Will remove, need to check with shmat() that it overwrites existing area too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed for mmap module, kept for sysv module as it is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like negotiation is not done by default, as default value of sshmem_base_start_address
remains the same
@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf, | |||
/* init the contents of map_segment_t */ | |||
shmem_ds_reset(ds_buf); | |||
|
|||
(void)munmap(mca_sshmem_base_start_address, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why added then?
rc = oshmem_shmem_allgather(&ptr, bases, sizeof(ptr)); | ||
if (OSHMEM_SUCCESS != rc) { | ||
MEMHEAP_ERROR("Failed to exchange selected vma for base segment " | ||
"(error %d)", rc); | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can also introduce an option without fallback to the original behavior? Then allgatherv will not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, in that case we could depend on mca_sshmem_base_start_address
value:
1- if 0: bcast the pointer value, and any rank unable to create fails on its side, global failure
2- if UINTPTR_MAX: bcast the pointer value, allgather so that they all fallback on default value
default could be point 2-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
base = ptr; | ||
} | ||
|
||
rc = oshmem_shmem_bcast(&base, sizeof(base), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich, tried the patch below where they all do the mmap(). the mmap() returned address is randomized like below, so we need some form of synchronization of the base adddress.
memheap_exchange_base_address() #1: exchange base address: base 0x7fa7d9dff000: ok
memheap_exchange_base_address() #3: exchange base address: base 0x7fdc5a15b000: ok
memheap_exchange_base_address() #2: exchange base address: base 0x7fe8aa56a000: ok
memheap_exchange_base_address() #0: exchange base address: base 0x7f3d1736b000: ok
diff --git a/oshmem/mca/memheap/base/memheap_base_select.c b/oshmem/mca/memheap/base/memheap_base_select.c
index 0ec74de6aa..0b0cfe4bee 100644
--- a/oshmem/mca/memheap/base/memheap_base_select.c
+++ b/oshmem/mca/memheap/base/memheap_base_select.c
@@ -134,21 +134,8 @@ static int memheap_exchange_base_address(size_t size, void **address)
return OSHMEM_ERROR;
}
- if (oshmem_my_proc_id() == 0) {
- ptr = memheap_mmap_get(NULL, size);
- base = ptr;
- }
-
- rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
- if (OSHMEM_SUCCESS != rc) {
- MEMHEAP_ERROR("Failed to exchange allocated vma for base segment "
- "(error %d)", rc);
- goto out;
- }
-
- if (oshmem_my_proc_id() != 0) {
- ptr = memheap_mmap_get(base, size);
- }
+ ptr = memheap_mmap_get(NULL, size);
+ base = ptr;
MEMHEAP_VERBOSE(100, "#%d: exchange base address: base %p: %s",
oshmem_my_proc_id(), base,
i do not understand that comment since new default address is ~0 and rank 0 allocates and bcast's the pointer value, but ack it is not a full negotiation. |
Signed-off-by: Thomas Vegas <tvegas@nvidia.com>
Signed-off-by: Thomas Vegas <tvegas@nvidia.com>
} else if (ptr != base) { | ||
/* Any failure terminates the rank and others start teardown */ | ||
rc = OSHMEM_ERROR; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use this as default flow (i mean setting mca_sshmem_base_start_address = NULL by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Thomas Vegas <tvegas@nvidia.com>
What
Processes have their
_end
that depends on the program built. Try negotiation first assuming symmetric layout will lead to same available memory areas. If not all ranks can create at the same position, fallback on the current hardcoded method.We need to keep the mmap() as a reservation in all cases, so that intermediate library calls do not consume it in between. If that happens, UCX module overrides it, causing some later corruption.
Tested
-mca sshmem_base_start_address 0xffffffffffffffff
or no option: negotiation takes place, mmap reservation-mca sshmem_base_start_address 0x7f.....
: no negotiation, mmap reservation, detection if failure to allocate.Static segment creation always skips module-created segment. Segments found in
/proc/self/maps
are always bigger or equal than module-allocated one.Misc
Configure:
./configure --prefix=rfs --enable-debug --with-ucx=rfs
Options:
-mca memheap_base_verbose 100
,-mca sshmem sysv/mmap/ucx