-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230
Comments
should the test be ? from that being said, i am a bit puzzled by this. this solves the case in which local task 0 created the file, but did not write yet. assuming we run into this case, it means task > 0 could try to open the file before it was created by task 0, and hence fail ... i would suspect higher level ensure we do not run into this in the first place. btw, which filesystem is used by the sm btl ? |
Thank for the comment. The file which is opened in sm_segment_attach is MCA_BTL_SM_RNDV_MOD_SM. And, I agree to test as following,
I think that task > 0 does not open the file before local task 0 creates the file.
The sm btl uses /dev/shm. it is not a network filesystem. |
@sekifjikkatsu Thanks for pointing 2), now this makes sense to me, and there is clearly a race condition. i can see three options
any preference ? @hjelmn @bosilca any thoughts ? (according to |
(...I'm only loosely reading this issue...) It sounds like there's an assumption in the >0 processes that the =0 process will atomicly open and write all N bytes at once -- which may not be true (and therefore there's a race condition). Would it be possible to have the =0 process open a different file, write the contents, and then |
@jsquyres using rename(2) is a simpler and better variant of my third option. |
There are 2 logical steps in here, separated by a fence (the modex operation). On each node the local master creates and initializes the shared memory files while in function create_rndv_file. This happen before the call to the modex, so there are no possible race conditions as the other processes are already blocked in the modex exchange (PMIX fence in ompi_mpi_init.c:644). Once the local master joins the modex, everybody completes and all remaining local processes continue with attaching to the shared file. |
What happens when we set mpi_add_procs_cutoff=0 (or exceed the cutoff point)? Under that condition, we don't execute a fence in mpi_init - we just fall thru. Do we need something to avoid the race condition in that case? |
If we don't have a fence in MPI_Init, then we do need to protect the creation of the shared file by other means. However, even when I set ompi_mpi_add_procs_cutoff to 0 I still see the fence at ompi_mpi_init.c:644. |
My bad - the cutoff is for dynamic add procs. The correct mca param is pmix_base_async_modex=1 |
I'm shifting this to the 2.0.0 release as 1.10 does not have the async modex feature. |
@sekifjikkatsu which openmpi version are you using ? @bosilca if I understand correctly your first reply, this issue cannot happen in v1.10 in the case of async modex, and if I still understand correctly, we need the |
@ggouaillardet let's try to keep the code in sync between the 2.x and 1.10. To protect against the missing local barrier, the rename method proposed by @jsquyres seems like a good solution. We can also replicate a local barrier by using another temporary file to synchronize local processes. |
write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes #1230
write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes open-mpi/ompi#1230 (cherry picked from commit open-mpi/ompi@db4f483)
I usually use Open MPI 1.8.4 version. Some user's programs run into this issue. And I tryed to use master version once, master version runs OPAL_MODEX in ompi_mpi_init? Waiting by PMIX(?) |
That is really puzzling... |
@sekifjikkatsu i still cannot make any sense of this
bottom line, and because of the modex in v1.8 and v1.10, i cannot see a race condition. that being said, and stricly speaking, a do you get error messages such as
do you get error messages such as |
write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes open-mpi/ompi#1230 (cherry picked from commit open-mpi/ompi@db4f483)
write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes open-mpi/ompi#1230 (back-ported from commit open-mpi/ompi@db4f483)
I got the error
Under my environment, Thanks for your correction patch. @ggouaillardet, I'm sorry that I confuse you. |
@sekifjikkatsu ok, now that makes sense to me, and i am glad the patch solved your issue ! |
Per #1230, add a comment explaining why we write to a temporary file and then rename(2) the file, just so that future code maintainers don't wonder why we do this seemingly-useless step.
Per open-mpi/ompi#1230, add a comment explaining why we write to a temporary file and then rename(2) the file, just so that future code maintainers don't wonder why we do this seemingly-useless step. (cherry picked from commit 6d073a8)
opal/asm: fix syntax of timer code for ia32
I think there is a problem in the btl/sm component.
It is the synchronization problem that is between the node master process and other child processes.
Other child processes read 0 byte from shared_mem_btl_rndv.HOSTNAME file before the node master
process writes sizeof(opal_shmem_ds_t) bytes to the file.
I think this problem can be solved with the following correction.
This problem occurs in the Open MPI version 1.8.
And it may not occur in the Open MPI master.
The text was updated successfully, but these errors were encountered: