Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add access() calls to ensure fopen(...,"r") doesn't segfault #4335

Closed
wants to merge 1 commit into from

Conversation

npe9
Copy link
Contributor

@npe9 npe9 commented Oct 13, 2017

One of the invariants of the fopen() function called with the "r"
option is that the file exist before the function is called. If
this is not the case behavior is undefined (in the case of glibc
on some Crays and Redhat machines this means you get a segfault
in fileno_unlocked()). This commit is a first attempt at ensuring
this invariant holds when calls to fopen() are made.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

One of the invariants of the fopen() function called with the "r"
option is that the file exist before the function is called. If
this is not the case behavior is undefined (in the case of glibc
on some Crays and Redhat machines this means you get a segfault
in fileno_unlocked()). This commit is a first attempt at ensuring
this invariant holds when calls to fopen() are made.

Signed-off-by: Noah Evans <nevans@sandia.gov>
@npe9
Copy link
Contributor Author

npe9 commented Oct 13, 2017

This code needs some pretty thorough review, both for style and to make sure I haven't forgotten anything.

Also the calls to access() depend on unistd.h and I expect some open-mpi environments might not support it.

@npe9
Copy link
Contributor Author

npe9 commented Oct 13, 2017

It's also worth noting that I've left hwloc alone. How is hwloc merged from its individual repo to its main tree?

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2017

We generally take hwloc release tarballs. However, at the moment, we have an early copy of HWLOC v2.0 from their nightly tarball. Your best bet would be to fork the hwloc repo and generate a pull request against their master.

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2017

BTW: are you sure about that existence requirement for fopen? I've been scanning the man pages, and I don't see that - indeed, there is a clear statement that the call will fail with ENOENT set in errno. Where are you seeing this requirement?

@npe9
Copy link
Contributor Author

npe9 commented Oct 13, 2017

@rhc54 You're right if gnu followed the posix standard.

From fopen

[CX] [Option Start] The character 'b' shall have no effect, but is allowed for ISO C standard conformance. [Option End] Opening a file with read mode (r as the first character in the mode argument) shall fail if the file does not exist or cannot be read.

The gnu documentation however makes this assumption:

‘r’ Open an existing file for reading only.

This means that any time OpenMPI does an fopen(...,"r") of a file that may or may not exist you get undefined behavior.

I've seen segfaults all over the place because of this. In particular when trying to open configuration or help files that don't exist and when specialized parts of the /proc file system (e.g. /proc/net/psched) aren't implemented on a particular system.

I'm certainly open to a better solution if you can think of one. In the meantime I'll fix whatever bugs pop up in the automated testing (it looks like, minimum, I broke an NFS test).

@jsquyres
Copy link
Member

Huh; I've never seen fopen() fail that way, but I can't say I've ever gone looking for it, either.

I wonder if we should do what we've done in a few other portability scenarios:

  • Make an opal_fopen() function that basically wraps calling access() and fopen(). It guarantees to return either a valid file pointer or NULL (e.g., if access() fails, it returns NULL).
    • Be sure to include a good comment in opal_fopen() that describes the need for this wrapper, perhaps even pointing to this PR.
  • Strategically #define fopen opal_fopen so that the entire code base benefits from this wrapper function.

That way, we don't have to add calls to access() everywhere, but we still protect the entire code base properly.

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2017

Me too - I've never heard of nor encountered this before, which is why I'm suspicious. I like the suggestion from @jsquyres - not only is it cleaner, but it means we only have one place that static code analyzers are going to complain about. Both Klockworks and Coverity will object that this PR creates a race condition between checking for file existence and then accessing it as there is no guarantee that the file hasn't been removed in the interim.

This is why we studiously cleansed the code base of calls to access (or stat) prior to calling a file access function - we were getting peppered with defect reports, and some of our companies (ahem, mine) will not allow release of code that crosses certain defect density boundaries. I'm a little bothered that we are now going back to reinsert those calls - but if we do the wrapper, then it will only count as one defect, so we might get away with it.

@npe9
Copy link
Contributor Author

npe9 commented Oct 13, 2017

@rhc54 @jsquyres Here's a relevant crash if I move my mca-params.conf out of my home directory:

#0  0x00007ffff6330060 in fileno_unlocked () from /lib64/libc.so.6
#1  0x00007ffff71b2965 in fopen (path=<optimized out>, mode=<optimized out>) at src/syscall.c:60
#2  0x00007fff760d1c59 in pmix_util_keyval_parse (filename=0x6f8c90 "/home/nevans/.pmix/mca-params.conf", callback=0x7fff7615b982 <save_value>)
    at util/keyval_parse.c:75
#3  0x00007fff7615b96e in pmix_mca_base_parse_paramfile (paramfile=0x6f8c90 "/home/nevans/.pmix/mca-params.conf",
    list=0x7fff763a1240 <pmix_mca_base_var_file_values>) at pmix_mca_base_parse_paramfile.c:44
#4  0x00007fff7615e24e in read_files (file_list=0x6f84e0 "/home/nevans/.pmix/mca-params.conf:/home/nevans/etc/pmix-mca-params.conf",
    file_values=0x7fff763a1240 <pmix_mca_base_var_file_values>, sep=58 ':') at pmix_mca_base_var.c:1269
#5  0x00007fff7615cca7 in pmix_mca_base_var_cache_files (rel_path_search=false) at pmix_mca_base_var.c:549
#6  0x00007fff7615c2f7 in pmix_mca_base_var_init () at pmix_mca_base_var.c:279
#7  0x00007fff7615646d in pmix_rte_init (type=PMIX_PROC_SERVER, info=0x6d9da0, ninfo=4, cbfunc=0x0) at runtime/pmix_init.c:134
#8  0x00007fff76103cd1 in OPAL_MCA_PMIX2X_PMIx_server_init (module=0x7fff7639c1c0 <mymodule>, info=0x6d9da0, ninfo=4) at server/pmix_server.c:153
#9  0x00007fff760adb3b in pmix2x_server_init (module=0x7ffff7dd4c40 <pmix_server>, info=0x7fffffffce60) at pmix2x_server_south.c:144
#10 0x00007ffff7b4fba1 in pmix_server_init () at orted/pmix/pmix_server.c:287
#11 0x00007ffff510f943 in rte_init () at ess_hnp_module.c:657
#12 0x00007ffff7b10b7d in orte_init (pargc=0x7fffffffd06c, pargv=0x7fffffffd060, flags=4) at runtime/orte_init.c:273
#13 0x00007ffff7b3db02 in orte_daemon (argc=16, argv=0x7fffffffd438) at orted/orted_main.c:345
#14 0x00000000004008ec in main (argc=16, argv=0x7fffffffd438) at orted.c:60
(gdb) disassemble
Dump of assembler code for function fileno_unlocked:
=> 0x00007ffff6330060 <+0>:	testl  $0x2000,(%rdi)
   0x00007ffff6330066 <+6>:	je     0x7ffff6330078 <fileno_unlocked+24>
...
(gdb) regs
...
rdi            0x0	0
...

So fileno is trying to dereference null pointer, which I believe is the indirection on the fp->_flags in the following code.

int
__fileno (_IO_FILE *fp)
{
  CHECK_FILE (fp, EOF);

  if (!(fp->_flags & _IO_IS_FILEBUF) || _IO_fileno (fp) < 0)
    {
      __set_errno (EBADF);
      return -1;
    }

  return _IO_fileno (fp);
}

@npe9
Copy link
Contributor Author

npe9 commented Oct 13, 2017

I'm compiling a personal glibc to see if I can figure out what the bug is exactly.

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2017

Weird - understand, very few users even have a local mca-params.conf in their home directory. I never have had one myself. In 13+ years of production, I have never heard anyone report a segfault because the file didn't exist.

As I said, maybe you have found some weird configuration that results in problems. We can certainly add the proposed protection (though I'd follow the suggestion from @jsquyres) - we just need to understand that it introduces a race condition that will generate (perfectly valid) complaints from the code checkers.

Might even want to ad a configure option to turn this off to avoid the complaints - like I said, nobody has ever reported such a thing, so maybe it only needs to be there for some configurations.

@npe9
Copy link
Contributor Author

npe9 commented Oct 13, 2017

@rhc54 Let me dig a little deeper here. I'll close this pull request for the moment (You guys' solution is better anyway) and see if I can really understand what its problem is. I can use this code to test the environment code now which will let me make sure the earlier pull request works.

@npe9 npe9 closed this Oct 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants