Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230

Closed
sekifjikkatsu opened this issue Dec 17, 2015 · 17 comments
Closed

a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230

sekifjikkatsu opened this issue Dec 17, 2015 · 17 comments
Assignees
Milestone

Comments

@sekifjikkatsu
Copy link

I think there is a problem in the btl/sm component.
It is the synchronization problem that is between the node master process and other child processes.
Other child processes read 0 byte from shared_mem_btl_rndv.HOSTNAME file before the node master
process writes sizeof(opal_shmem_ds_t) bytes to the file.
I think this problem can be solved with the following correction.

Index: ompi/mca/btl/sm/btl_sm.c
===================================================================
--- ompi/mca/btl/sm/btl_sm.c    (リビジョン 4013)
+++ ompi/mca/btl/sm/btl_sm.c    (作業コピー)
@@ -180,6 +180,7 @@
 static int
 sm_segment_attach(mca_btl_sm_component_t *comp_ptr)
 {
+    struct stat buf;
     int rc = OMPI_SUCCESS;
     int fd = -1;
     ssize_t bread = 0;
@@ -195,6 +196,14 @@
         rc = OMPI_ERR_IN_ERRNO;
         goto out;
     }
+    do {
+        if (0 != fstat(fd,&buf)) {
+            opal_output(0, "sm_segment_attach: "
+                           "fstat errno=%d\n",errno);
+            rc = OMPI_ERROR;
+            goto out;
+        }
+    } while (sizeof(opal_shmem_ds_t) != buf.st_size);
     if ((ssize_t)sizeof(opal_shmem_ds_t) != (bread =
         read(fd, tmp_shmem_ds, sizeof(opal_shmem_ds_t)))) {
         opal_output(0, "sm_segment_attach: "

This problem occurs in the Open MPI version 1.8.
And it may not occur in the Open MPI master.

@ggouaillardet
Copy link
Contributor

should the test be ?
while (sizeof(opal_shmem_ds_t) > buf.st_size)

from create_rndv_file if type is MCA_BTL_SM_RNDV_MOD_MPOOL, some extra stuff is written into the file.

that being said, i am a bit puzzled by this.

this solves the case in which local task 0 created the file, but did not write yet.

assuming we run into this case, it means task > 0 could try to open the file before it was created by task 0, and hence fail ... i would suspect higher level ensure we do not run into this in the first place.

btw, which filesystem is used by the sm btl ?
local filesystem ? tmpfs ? /dev/shm ? network filesystem ?
if this is a network filesystem, did you get a warning about that ? (or did you configure ompi not to issue a warning ?) the recommendation is not to use a network filesystem for this.

@sekifjikkatsu
Copy link
Author

Thank for the comment.

The file which is opened in sm_segment_attach is MCA_BTL_SM_RNDV_MOD_SM.
So, I think that size of the file is sizeof (opal_shmem_ds_t).

And, I agree to test as following,

while (sizeof(opal_shmem_ds_t) > buf.st_size)

I think that task > 0 does not open the file before local task 0 creates the file.
Task > 0 wait for the file becomes R_OK state by access() in setup_mpool_base_resources.

     /* Wait for the file to be created */
    while (0 != access(comp_ptr->sm_rndv_file_name, R_OK)) {
        opal_progress();
    }

The sm btl uses /dev/shm. it is not a network filesystem.

@ggouaillardet
Copy link
Contributor

@sekifjikkatsu Thanks for pointing 2), now this makes sense to me, and there is clearly a race condition.

i can see three options

  1. while ((sizeof(opal_shmem_ds_t) + out_res->size) > buf.st_size)
    /* i previously missed the second read, hence out_res->size , also, opal_progress() should be invoked if the size is too small */

  2. replace the access test by stat and also check the size

  3. create the file with a name specific to local task 0, write its content, and then hard link it to the name that is used by local task > 0

any preference ?

@hjelmn @bosilca any thoughts ? (according to owner.txt, UTK is the maintainer, but i am not sure of who is maintaining this module)

@jsquyres
Copy link
Member

(...I'm only loosely reading this issue...)

It sounds like there's an assumption in the >0 processes that the =0 process will atomicly open and write all N bytes at once -- which may not be true (and therefore there's a race condition).

Would it be possible to have the =0 process open a different file, write the contents, and then rename(2) the file to be the file that the >0 processes are expecting? This would guarantee that, from the >0 process' perspective, the file atomicly appeared and had the correct contents.

@ggouaillardet
Copy link
Contributor

@jsquyres using rename(2) is a simpler and better variant of my third option.
I will make the prs from Monday unless there are any objections.

@bosilca
Copy link
Member

bosilca commented Dec 19, 2015

There are 2 logical steps in here, separated by a fence (the modex operation). On each node the local master creates and initializes the shared memory files while in function create_rndv_file. This happen before the call to the modex, so there are no possible race conditions as the other processes are already blocked in the modex exchange (PMIX fence in ompi_mpi_init.c:644). Once the local master joins the modex, everybody completes and all remaining local processes continue with attaching to the shared file.

@rhc54
Copy link
Contributor

rhc54 commented Dec 19, 2015

What happens when we set mpi_add_procs_cutoff=0 (or exceed the cutoff point)? Under that condition, we don't execute a fence in mpi_init - we just fall thru. Do we need something to avoid the race condition in that case?

@bosilca
Copy link
Member

bosilca commented Dec 19, 2015

If we don't have a fence in MPI_Init, then we do need to protect the creation of the shared file by other means. However, even when I set ompi_mpi_add_procs_cutoff to 0 I still see the fence at ompi_mpi_init.c:644.

@rhc54
Copy link
Contributor

rhc54 commented Dec 19, 2015

My bad - the cutoff is for dynamic add procs. The correct mca param is pmix_base_async_modex=1

@rhc54 rhc54 modified the milestones: v2.0.0, v1.10.2 Dec 19, 2015
@rhc54
Copy link
Contributor

rhc54 commented Dec 19, 2015

I'm shifting this to the 2.0.0 release as 1.10 does not have the async modex feature.

@ggouaillardet
Copy link
Contributor

@sekifjikkatsu which openmpi version are you using ?
did you run into this issue ? or did you suspect there is a race condition based on code review ?
if you are using master or v2.x do you use async modex ?

@bosilca if I understand correctly your first reply, this issue cannot happen in v1.10
is it then correct to go one step further and remove the while(acces) loop ?

in the case of async modex, and if I still understand correctly, we need the while(acces) loop, but this is not enough to avoid the race condition. so does the rename approach suggested by Jeff look like a good fit here ?

@bosilca
Copy link
Member

bosilca commented Dec 20, 2015

@ggouaillardet let's try to keep the code in sync between the 2.x and 1.10. To protect against the missing local barrier, the rename method proposed by @jsquyres seems like a good solution. We can also replicate a local barrier by using another temporary file to synchronize local processes.

ggouaillardet added a commit that referenced this issue Dec 21, 2015
write to file and then rename, so when the file is open for read, its content is known to have been written.

Fixes #1230
This was referenced Dec 21, 2015
ggouaillardet added a commit to ggouaillardet/ompi-release that referenced this issue Dec 21, 2015
write to file and then rename, so when the file is open for read, its content is known to have been written.

Fixes open-mpi/ompi#1230

(cherry picked from commit open-mpi/ompi@db4f483)
@sekifjikkatsu
Copy link
Author

I usually use Open MPI 1.8.4 version.

Some user's programs run into this issue.
(It is not via source code review.)

And I tryed to use master version once,
this issue did not occur.

master version runs OPAL_MODEX in ompi_mpi_init?
But 1.8 version do not have PMIX fence.

Waiting by PMIX(?)
(gdb) where
#0 0x00007f6ff03cea7d in nanosleep () at
../sysdeps/unix/syscall-template.S:82
#1 0x00007f6ff0403c24 in usleep (useconds=)
at ../sysdeps/unix/sysv/linux/usleep.c:33
#2 0x00007f6fedc2ffea in OPAL_PMIX_PMIX1XX_PMIx_Fence ()
from /home/seki/MPI/origrun/env/master/lib/libpmix.so.0
#3 0x00007f6fede55c27 in pmix1_fence ()
from /home/seki/MPI/origrun/env/master/lib/openmpi/mca_pmix_pmix1xx.so
#4 0x00007f6ff0918ac6 in ompi_mpi_init ()
from /home/seki/MPI/origrun/env/master/lib/libmpi.so.0
#5 0x00007f6ff0937b50 in PMPI_Init ()
from /home/seki/MPI/origrun/env/master/lib/libmpi.so.0
#6 0x0000000000400867 in main ()

@ggouaillardet
Copy link
Contributor

That is really puzzling...
We concluded the only way to run into this issue was with master (or maybe v2.x) and async modex (!)
I ran v1.10 under debugger and there is definetly an "external" barrier, so i can nothing but concur this cannot happen.
I will try 1.8.4 tomorrow to confirm.
In the mean time, i pushed a fix to master and will pr tomorrow for v1.10

@ggouaillardet
Copy link
Contributor

@sekifjikkatsu i still cannot make any sense of this
i ran 1.8.4 under the debugger and here is that happens in ompi_mpi_init()

  • at line 611, mca_pml_base_select() will cause local task(s) zero to create the files, write and close
  • line 641 is the modex, this is basically a barrier involving all tasks, so when the barrier exits, the file have been created, written and closed.
  • at line 770 MCA_PML_CALL(add_procs()) will cause all tasks to open read-only the file and read its content.

bottom line, and because of the modex in v1.8 and v1.10, i cannot see a race condition.

that being said, and stricly speaking, a read can legitimately return less than the requested value, but this is currently considered as an error.

do you get error messages such as

A system call failed during sm BTL initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

System call: write(2)
Error: 0

do you get error messages such as Read inconsistency -- read: ?
if yes, is bread zero (i cannot see that happen ...) or non zero (legitimate short read ompi does not currently handle) ?

ggouaillardet added a commit to ggouaillardet/ompi-release that referenced this issue Dec 22, 2015
write to file and then rename, so when the file is open for read, its content is known to have been written.

Fixes open-mpi/ompi#1230

(cherry picked from commit open-mpi/ompi@db4f483)
ggouaillardet added a commit to ggouaillardet/ompi-release that referenced this issue Dec 22, 2015
write to file and then rename, so when the file is open for read, its content is known to have been written.

Fixes open-mpi/ompi#1230

(back-ported from commit open-mpi/ompi@db4f483)
@sekifjikkatsu
Copy link
Author

I got the error

sm_segment_attach: Read inconsistency -- read: 0, but expected: 4136!

Under my environment,
grpcomm component is a customized one,and its modex has no barrier.
So, I think that the race condition occurs only under my environment.

Thanks for your correction patch.
It can solve the race condition.
It does not need modex barrier.

@ggouaillardet, I'm sorry that I confuse you.

@ggouaillardet
Copy link
Contributor

@sekifjikkatsu ok, now that makes sense to me, and i am glad the patch solved your issue !

jsquyres added a commit that referenced this issue Jan 4, 2016
Per #1230, add a comment explaining why we write to a
temporary file and then rename(2) the file, just so that future code
maintainers don't wonder why we do this seemingly-useless step.
jsquyres added a commit to jsquyres/ompi-release that referenced this issue Jan 4, 2016
Per open-mpi/ompi#1230, add a comment explaining why we write to a
temporary file and then rename(2) the file, just so that future code
maintainers don't wonder why we do this seemingly-useless step.

(cherry picked from commit 6d073a8)
jsquyres added a commit to jsquyres/ompi that referenced this issue Aug 23, 2016
opal/asm: fix syntax of timer code for ia32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants