a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230

sekifjikkatsu · 2015-12-17T01:07:44Z

I think there is a problem in the btl/sm component.
It is the synchronization problem that is between the node master process and other child processes.
Other child processes read 0 byte from shared_mem_btl_rndv.HOSTNAME file before the node master
process writes sizeof(opal_shmem_ds_t) bytes to the file.
I think this problem can be solved with the following correction.

Index: ompi/mca/btl/sm/btl_sm.c
===================================================================
--- ompi/mca/btl/sm/btl_sm.c    (リビジョン 4013)
+++ ompi/mca/btl/sm/btl_sm.c    (作業コピー)
@@ -180,6 +180,7 @@
 static int
 sm_segment_attach(mca_btl_sm_component_t *comp_ptr)
 {
+    struct stat buf;
     int rc = OMPI_SUCCESS;
     int fd = -1;
     ssize_t bread = 0;
@@ -195,6 +196,14 @@
         rc = OMPI_ERR_IN_ERRNO;
         goto out;
     }
+    do {
+        if (0 != fstat(fd,&buf)) {
+            opal_output(0, "sm_segment_attach: "
+                           "fstat errno=%d\n",errno);
+            rc = OMPI_ERROR;
+            goto out;
+        }
+    } while (sizeof(opal_shmem_ds_t) != buf.st_size);
     if ((ssize_t)sizeof(opal_shmem_ds_t) != (bread =
         read(fd, tmp_shmem_ds, sizeof(opal_shmem_ds_t)))) {
         opal_output(0, "sm_segment_attach: "

This problem occurs in the Open MPI version 1.8.
And it may not occur in the Open MPI master.

ggouaillardet · 2015-12-18T01:26:57Z

should the test be ?
while (sizeof(opal_shmem_ds_t) > buf.st_size)

from create_rndv_file if type is MCA_BTL_SM_RNDV_MOD_MPOOL, some extra stuff is written into the file.

that being said, i am a bit puzzled by this.

this solves the case in which local task 0 created the file, but did not write yet.

assuming we run into this case, it means task > 0 could try to open the file before it was created by task 0, and hence fail ... i would suspect higher level ensure we do not run into this in the first place.

btw, which filesystem is used by the sm btl ?
local filesystem ? tmpfs ? /dev/shm ? network filesystem ?
if this is a network filesystem, did you get a warning about that ? (or did you configure ompi not to issue a warning ?) the recommendation is not to use a network filesystem for this.

sekifjikkatsu · 2015-12-18T10:13:38Z

Thank for the comment.

The file which is opened in sm_segment_attach is MCA_BTL_SM_RNDV_MOD_SM.
So, I think that size of the file is sizeof (opal_shmem_ds_t).

And, I agree to test as following,

while (sizeof(opal_shmem_ds_t) > buf.st_size)

I think that task > 0 does not open the file before local task 0 creates the file.
Task > 0 wait for the file becomes R_OK state by access() in setup_mpool_base_resources.

     /* Wait for the file to be created */
    while (0 != access(comp_ptr->sm_rndv_file_name, R_OK)) {
        opal_progress();
    }

The sm btl uses /dev/shm. it is not a network filesystem.

ggouaillardet · 2015-12-18T11:52:06Z

@sekifjikkatsu Thanks for pointing 2), now this makes sense to me, and there is clearly a race condition.

i can see three options

while ((sizeof(opal_shmem_ds_t) + out_res->size) > buf.st_size)
/* i previously missed the second read, hence out_res->size , also, opal_progress() should be invoked if the size is too small */
replace the access test by stat and also check the size
create the file with a name specific to local task 0, write its content, and then hard link it to the name that is used by local task > 0

any preference ?

@hjelmn @bosilca any thoughts ? (according to owner.txt, UTK is the maintainer, but i am not sure of who is maintaining this module)

jsquyres · 2015-12-18T14:30:35Z

(...I'm only loosely reading this issue...)

It sounds like there's an assumption in the >0 processes that the =0 process will atomicly open and write all N bytes at once -- which may not be true (and therefore there's a race condition).

Would it be possible to have the =0 process open a different file, write the contents, and then rename(2) the file to be the file that the >0 processes are expecting? This would guarantee that, from the >0 process' perspective, the file atomicly appeared and had the correct contents.

ggouaillardet · 2015-12-18T15:04:49Z

@jsquyres using rename(2) is a simpler and better variant of my third option.
I will make the prs from Monday unless there are any objections.

bosilca · 2015-12-19T05:26:07Z

There are 2 logical steps in here, separated by a fence (the modex operation). On each node the local master creates and initializes the shared memory files while in function create_rndv_file. This happen before the call to the modex, so there are no possible race conditions as the other processes are already blocked in the modex exchange (PMIX fence in ompi_mpi_init.c:644). Once the local master joins the modex, everybody completes and all remaining local processes continue with attaching to the shared file.

rhc54 · 2015-12-19T05:58:31Z

What happens when we set mpi_add_procs_cutoff=0 (or exceed the cutoff point)? Under that condition, we don't execute a fence in mpi_init - we just fall thru. Do we need something to avoid the race condition in that case?

bosilca · 2015-12-19T06:11:03Z

If we don't have a fence in MPI_Init, then we do need to protect the creation of the shared file by other means. However, even when I set ompi_mpi_add_procs_cutoff to 0 I still see the fence at ompi_mpi_init.c:644.

rhc54 · 2015-12-19T15:25:21Z

My bad - the cutoff is for dynamic add procs. The correct mca param is pmix_base_async_modex=1

rhc54 · 2015-12-19T19:35:52Z

I'm shifting this to the 2.0.0 release as 1.10 does not have the async modex feature.

ggouaillardet · 2015-12-20T03:50:49Z

@sekifjikkatsu which openmpi version are you using ?
did you run into this issue ? or did you suspect there is a race condition based on code review ?
if you are using master or v2.x do you use async modex ?

@bosilca if I understand correctly your first reply, this issue cannot happen in v1.10
is it then correct to go one step further and remove the while(acces) loop ?

in the case of async modex, and if I still understand correctly, we need the while(acces) loop, but this is not enough to avoid the race condition. so does the rename approach suggested by Jeff look like a good fit here ?

bosilca · 2015-12-20T17:24:45Z

@ggouaillardet let's try to keep the code in sync between the 2.x and 1.10. To protect against the missing local barrier, the rename method proposed by @jsquyres seems like a good solution. We can also replicate a local barrier by using another temporary file to synchronize local processes.

write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes #1230

write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes open-mpi/ompi#1230 (cherry picked from commit open-mpi/ompi@db4f483)

sekifjikkatsu · 2015-12-21T09:21:03Z

I usually use Open MPI 1.8.4 version.

Some user's programs run into this issue.
(It is not via source code review.)

And I tryed to use master version once,
this issue did not occur.

master version runs OPAL_MODEX in ompi_mpi_init?
But 1.8 version do not have PMIX fence.

Waiting by PMIX(?)
(gdb) where
#0 0x00007f6ff03cea7d in nanosleep () at
../sysdeps/unix/syscall-template.S:82
#1 0x00007f6ff0403c24 in usleep (useconds=)
at ../sysdeps/unix/sysv/linux/usleep.c:33
#2 0x00007f6fedc2ffea in OPAL_PMIX_PMIX1XX_PMIx_Fence ()
from /home/seki/MPI/origrun/env/master/lib/libpmix.so.0
#3 0x00007f6fede55c27 in pmix1_fence ()
from /home/seki/MPI/origrun/env/master/lib/openmpi/mca_pmix_pmix1xx.so
#4 0x00007f6ff0918ac6 in ompi_mpi_init ()
from /home/seki/MPI/origrun/env/master/lib/libmpi.so.0
#5 0x00007f6ff0937b50 in PMPI_Init ()
from /home/seki/MPI/origrun/env/master/lib/libmpi.so.0
#6 0x0000000000400867 in main ()

ggouaillardet · 2015-12-21T09:42:09Z

That is really puzzling...
We concluded the only way to run into this issue was with master (or maybe v2.x) and async modex (!)
I ran v1.10 under debugger and there is definetly an "external" barrier, so i can nothing but concur this cannot happen.
I will try 1.8.4 tomorrow to confirm.
In the mean time, i pushed a fix to master and will pr tomorrow for v1.10

ggouaillardet · 2015-12-22T06:27:38Z

@sekifjikkatsu i still cannot make any sense of this
i ran 1.8.4 under the debugger and here is that happens in ompi_mpi_init()

at line 611, mca_pml_base_select() will cause local task(s) zero to create the files, write and close
line 641 is the modex, this is basically a barrier involving all tasks, so when the barrier exits, the file have been created, written and closed.
at line 770 MCA_PML_CALL(add_procs()) will cause all tasks to open read-only the file and read its content.

bottom line, and because of the modex in v1.8 and v1.10, i cannot see a race condition.

that being said, and stricly speaking, a read can legitimately return less than the requested value, but this is currently considered as an error.

do you get error messages such as

A system call failed during sm BTL initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

System call: write(2)
Error: 0

do you get error messages such as Read inconsistency -- read: ?
if yes, is bread zero (i cannot see that happen ...) or non zero (legitimate short read ompi does not currently handle) ?

write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes open-mpi/ompi#1230 (cherry picked from commit open-mpi/ompi@db4f483)

write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes open-mpi/ompi#1230 (back-ported from commit open-mpi/ompi@db4f483)

sekifjikkatsu · 2015-12-28T07:31:55Z

I got the error

sm_segment_attach: Read inconsistency -- read: 0, but expected: 4136!

Under my environment,
grpcomm component is a customized one,and its modex has no barrier.
So, I think that the race condition occurs only under my environment.

Thanks for your correction patch.
It can solve the race condition.
It does not need modex barrier.

@ggouaillardet, I'm sorry that I confuse you.

ggouaillardet · 2015-12-28T07:35:20Z

@sekifjikkatsu ok, now that makes sense to me, and i am glad the patch solved your issue !

Per #1230, add a comment explaining why we write to a temporary file and then rename(2) the file, just so that future code maintainers don't wonder why we do this seemingly-useless step.

Per open-mpi/ompi#1230, add a comment explaining why we write to a temporary file and then rename(2) the file, just so that future code maintainers don't wonder why we do this seemingly-useless step. (cherry picked from commit 6d073a8)

opal/asm: fix syntax of timer code for ia32

rhc54 added bug Severity: blocker labels Dec 19, 2015

rhc54 added this to the v1.10.2 milestone Dec 19, 2015

rhc54 assigned ggouaillardet Dec 19, 2015

rhc54 modified the milestones: v2.0.0, v1.10.2 Dec 19, 2015

ggouaillardet added a commit that referenced this issue Dec 21, 2015

btl/sm: fix race condition

db4f483

write to file and then rename, so when the file is open for read, its content is known to have been written. Fixes #1230

This was referenced Dec 21, 2015

btl/sm: fix race condition #1245

Closed

btl/sm: fix race condition #1246

Merged

ggouaillardet closed this as completed in #1246 Dec 21, 2015

ggouaillardet mentioned this issue Dec 21, 2015

btl/sm: fix race condition open-mpi/ompi-release#861

Merged

jsquyres mentioned this issue Jan 4, 2016

btl_sm: add a comment explaining why we rename(2) open-mpi/ompi-release#882

Merged

jsquyres added a commit to jsquyres/ompi that referenced this issue Aug 23, 2016

Merge pull request open-mpi#1230 from hjelmn/v2.x_ia32_fix

cabc66e

opal/asm: fix syntax of timer code for ia32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230

a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230

sekifjikkatsu commented Dec 17, 2015

ggouaillardet commented Dec 18, 2015

sekifjikkatsu commented Dec 18, 2015

ggouaillardet commented Dec 18, 2015

jsquyres commented Dec 18, 2015

ggouaillardet commented Dec 18, 2015

bosilca commented Dec 19, 2015

rhc54 commented Dec 19, 2015

bosilca commented Dec 19, 2015

rhc54 commented Dec 19, 2015

rhc54 commented Dec 19, 2015

ggouaillardet commented Dec 20, 2015

bosilca commented Dec 20, 2015

sekifjikkatsu commented Dec 21, 2015

ggouaillardet commented Dec 21, 2015

ggouaillardet commented Dec 22, 2015

sekifjikkatsu commented Dec 28, 2015

ggouaillardet commented Dec 28, 2015

a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230

a problem in the btl/sm shared_mem_btl_rndv.HOSTNAME #1230

Comments

sekifjikkatsu commented Dec 17, 2015

ggouaillardet commented Dec 18, 2015

sekifjikkatsu commented Dec 18, 2015

ggouaillardet commented Dec 18, 2015

jsquyres commented Dec 18, 2015

ggouaillardet commented Dec 18, 2015

bosilca commented Dec 19, 2015

rhc54 commented Dec 19, 2015

bosilca commented Dec 19, 2015

rhc54 commented Dec 19, 2015

rhc54 commented Dec 19, 2015

ggouaillardet commented Dec 20, 2015

bosilca commented Dec 20, 2015

sekifjikkatsu commented Dec 21, 2015

ggouaillardet commented Dec 21, 2015

ggouaillardet commented Dec 22, 2015

sekifjikkatsu commented Dec 28, 2015

ggouaillardet commented Dec 28, 2015