Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCP4 segfaults for lack of vsyscall page #1466

Closed
Oeffner opened this issue Dec 6, 2016 · 12 comments
Closed

CCP4 segfaults for lack of vsyscall page #1466

Oeffner opened this issue Dec 6, 2016 · 12 comments

Comments

@Oeffner
Copy link

Oeffner commented Dec 6, 2016

Please use the following bug reporting template to help produce actionable and reproducible issues. Please try to ensure that the reproduction is minimal so that the team can go through more bugs!

  • A brief description
    I am working with scientific programs from CCP4 that all run fine on native Linux but crash on WSL. They are text mode and do scientific number crunching. They have all been compiled with Intel fortran using MKL and OpenMP. Due to copyright restrictions I am not able to submit neither a binary nor the source code of any of the programs. Nor have I so far been able to reproduce the crash with my own small example code using intels compiler.

  • Expected results
    When run without arguments these programs should print about 40 lines help text to stdout.

  • Actual results (with terminal output if applicable)
    Instant termination with no output to stdout.
    I think that strace suggest that the problem is with the futex system call.

  • Your Windows build number
    10.0.14971

  • Steps / All commands required to reproduce the error from a brand new installation
    After installing CCP4 run any of the commands, shelxe, shelxc, shelxd, shelxl, shelxs, shelxt from a bash shell.

  • Strace of the failing command

root@Pauli:~/LinuxTest#
root@Pauli:~/LinuxTest# strace -ff shelxd
execve("/mnt/b/LinuxTest/ccp4-7.0/bin/shelxd", ["shelxd"], [/* 36 vars */]) = 0
uname({sys="Linux", node="Pauli", ...}) = 0
brk(0)                                  = 0x2773000
brk(0x2774110)                          = 0x2774110
arch_prctl(ARCH_SET_FS, 0x2773800)      = 0
set_tid_address(0x2773ad0)              = 1524
set_robust_list(0x2773ae0, 24)          = 0
futex(0x7fffe629bf6c, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fffe629bf6c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 2773800) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x1982830, [], SA_RESTORER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x1982760, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=8192*1024}) = 0
brk(0x2795110)                          = 0x2795110
brk(0x2796000)                          = 0x2796000
rt_sigaction(SIGFPE, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGILL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGSEGV, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGABRT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], SA_RESTORER, 0x7f843b026cb0}, 8) = 0
rt_sigaction(SIGINT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], SA_RESTORER, 0x7f843b026cb0}, 8) = 0
futex(0x1fcd1a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0xffffffffff600000} ---
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0xffffffffff600000} ---
exit_group(174)                         = ?
+++ exited with 174 +++
root@Pauli:~/LinuxTest#
  • Required packages and commands to install
    Install the scientific software suite CCP4 + SHELX from http://www.ccp4.ac.uk/download/index.php#os=linux on WSL, about 750Gb. Uncompress the package ccp4-7.0-shelx-linux-x86_64.tar.bz2 and run the setup script.

See our contributing instructions for assistance.

@therealkenc
Copy link
Collaborator

therealkenc commented Dec 6, 2016

Shows in the straces of #1134, #1027, #670 et al too. WSL does not have real-time support, generally. [edit] but it looks like the problem is different than those issues; death is segfault signal related, and probably not due to the FUTEX_CLOCK_REALTIME futex, which is just returning EAGAIN, same as on native.

@Oeffner
Copy link
Author

Oeffner commented Dec 6, 2016

That may be correct. When these program work they start with printing the time of the day to stdout. Running the programs through gdb seems to support this:

root@Pauli:~/LinuxTest#
root@Pauli:~/LinuxTest# gdb -q shelxd
Reading symbols from shelxd...done.
(gdb) run
Starting program: /mnt/b/LinuxTest/ccp4-7.0/bin/shelxd
warning: Error disabling address space randomisation: Success

Program received signal SIGSEGV, Segmentation fault.
0xffffffffff600000 in ?? ()
(gdb) bt
#0  0xffffffffff600000 in ?? ()
#1  0x00000000019affdd in gettimeofday () at ../sysdeps/unix/sysv/linux/x86_64/gettimeofday.S:37
#2  0x00000000019352f1 in __kmp_read_system_time ()
#3  0x000000000191c51f in __kmp_do_serial_initialize() ()
#4  0x0000000001916658 in __kmp_internal_begin ()
#5  0x00000000018fa142 in __kmpc_begin ()
#6  0x00000000004003b3 in MAIN__ ()
(gdb) quit
A debugging session is active.

        Inferior 1 [process 1675] will be killed.

Quit anyway? (y or n) y
root@Pauli:~/LinuxTest#

All of these programs are compiled as static binaries with the intel compiler. If compiling them using dynamic library loading they run fine. Also compiling them with gfortran they run fine. I'm a little confused as to why this is so. I guess different compilers have different ways of achieving the same functionality.

@Oeffner Oeffner closed this as completed Dec 6, 2016
@Oeffner Oeffner reopened this Dec 6, 2016
@Oeffner
Copy link
Author

Oeffner commented Dec 6, 2016

closed by mistake, sorry!

So what are the suggestions on the best course of actions for program developers or is there any chance that real time support will be available in WSL?

@therealkenc
Copy link
Collaborator

It probably isn't the compiler per se, but rather how some dependent static library is doing stuff. For now your options are (1) find and work around the failing system call dependency in the source, (2) wait for the WSL guys to implement the feature. Features do get attention if they show up enough here on github (it has) and/or in their telemetry. There's plenty of unimplemented surface to choose from and everyone has their favorite.

And as always you can open a feature request on User Voice. YRMV.

@misenesi
Copy link

misenesi commented Dec 7, 2016

rt_sigaction is implemented, so that doesn't seem to be the issue. How does the strace look like on Ubuntu? Also since it uses OpenMP, could you provide both straces with '-f' option to follow forks, so we could see complete picture?

@Oeffner
Copy link
Author

Oeffner commented Dec 7, 2016

Right. So stdout on a native Ubuntu 14 machine with the program is as below:

LABS\rdo20@lamprey:~/Sources/FortranTests$
LABS\rdo20@lamprey:~/Sources/FortranTests$ ./shelxd

  +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  +  SHELXD 2013/2     MACROMOLECULAR DIRECT AND PATTERSON METHODS  +
  +  Copyright(c) George M. Sheldrick 2000-2013  Multi-CPU version  +
  +                             started at 12:07:02 on 07 Dec 2016  +
  +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

  8 threads running in parallel on  8 CPUs

 The instructions are in name.ins and the reflection data in name.hkl

 Command line switches:
 -tN use only N threads (default: use all)
 -LN reserve working space for 100000N reflections (default: -L10)

LABS\rdo20@lamprey:~/Sources/FortranTests$

and with the strace it is:

LABS\rdo20@lamprey:~/Sources/FortranTests$
LABS\rdo20@lamprey:~/Sources/FortranTests$ strace -ff ./shelxd
execve("./shelxd", ["./shelxd"], [/* 22 vars */]) = 0
uname({sys="Linux", node="lamprey", ...}) = 0
brk(0)                                  = 0x3746000
brk(0x3747170)                          = 0x3747170
arch_prctl(ARCH_SET_FS, 0x3746860)      = 0
set_tid_address(0x3746b30)              = 8091
set_robust_list(0x3746b40, 24)          = 0
futex(0x7ffcfe9763fc, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ffcfe9763fc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 3746860) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x1982830, [], SA_RESTORER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x1982760, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(0x3768170)                          = 0x3768170
brk(0x3769000)                          = 0x3769000
rt_sigaction(SIGFPE, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGILL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGSEGV, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGABRT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], 0}, 8) = 0
futex(0x1fcd1a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 3
read(3, "cpu  6100975 110963 582956 14474"..., 8192) = 2479
close(3)                                = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb9929e3000
sched_getaffinity(0, 1048576, {ff, 0, 0, 0}) = 32
sched_setaffinity(0, 32, 0)             = -1 EFAULT (Bad address)
munmap(0x7fb9929e3000, 1052672)         = 0
rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGQUIT, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGILL, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGABRT, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGFPE, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGSEGV, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 0}, ...}) = 0
sched_getaffinity(0, 32, {ff, 0, 0, 0}) = 32
sched_setaffinity(0, 32, {1, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {2, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {4, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {8, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {10, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {20, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {40, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {80, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {ff, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {ff, 0, 0, 0}) = 0
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb990c5f000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb98edda000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb98cf55000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb98b0d0000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb98924b000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb9873c6000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb985541000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb9836bc000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb981837000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb97f9b2000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb97f6a4000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb97d81f000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb97b99a000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb979b15000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb977c90000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb975e0b000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb973f86000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb972101000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb97027c000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb96e3f7000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb96c572000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb96a6ed000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb96a3df000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb96a0d1000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb969dc3000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb969ab5000
open("/etc/localtime", O_RDONLY)        = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb969ab4000
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\7\0\0\0\7\0\0\0\0"..., 4096) = 3661
lseek(3, -2338, SEEK_CUR)               = 1323
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\10\0\0\0\0"..., 4096) = 2338
close(3)                                = 0
munmap(0x7fb969ab4000, 4096)            = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
readlink("/proc/self/fd/1", "/dev/pts/12", 4095) = 11
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "\n", 1
)                       = 1
write(1, "  ++++++++++++++++++++++++++++++"..., 70  +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
) = 70
write(1, "  +  SHELXD 2013/2     MACROMOLE"..., 70  +  SHELXD 2013/2     MACROMOLECULAR DIRECT AND PATTERSON METHODS  +
) = 70
write(1, "  +  Copyright(c) George M. Shel"..., 70  +  Copyright(c) George M. Sheldrick 2000-2013  Multi-CPU version  +
) = 70
write(1, "  +                             "..., 70  +                             started at 12:03:51 on 07 Dec 2016  +
) = 70
write(1, "  ++++++++++++++++++++++++++++++"..., 70  +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
) = 70
write(1, "\n", 1
)                       = 1
write(1, "  8 threads running in parallel "..., 43  8 threads running in parallel on  8 CPUs
) = 43
write(1, "\n", 1
)                       = 1
write(1, " The instructions are in name.in"..., 70 The instructions are in name.ins and the reflection data in name.hkl
) = 70
write(1, "\n", 1
)                       = 1
write(1, " Command line switches:\n", 24 Command line switches:
) = 24
write(1, " -tN use only N threads (default"..., 43 -tN use only N threads (default: use all)
) = 43
write(1, " -LN reserve working space for 1"..., 67 -LN reserve working space for 100000N reflections (default: -L10)
) = 67
write(1, "\n", 1
)                       = 1
exit_group(0)                           = ?
+++ exited with 0 +++
LABS\rdo20@lamprey:~/Sources/FortranTests$

If I'm not mistaken it looks like WSL fails after the syscall futex(0x1fcd1a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 whereas the native Ubuntu continues after that with open("/proc/stat", O_RDONLY|O_CLOEXEC) = 3
and so forth.

@misenesi
Copy link

misenesi commented Dec 7, 2016

@Oeffner Thanks for the strace from Ubuntu. The '-ff' switch outputs strace from different processes into different files, '-f' would be more useful in our scenario, as that outputs all logs into one file.

It is possible that the futex call is crashing, but the fastest way for us to fix it would be to have a repro. Could you write a sample app and compile it in a similar fashion? The crash happens very early on, so I believe that should repro it.

@Oeffner
Copy link
Author

Oeffner commented Dec 7, 2016

OK here is the strace of the program running on native Ubuntu with the -f flag:

LABS\rdo20@lamprey:~$
LABS\rdo20@lamprey:~$ strace -f ./shelxd
strace: Can't stat './shelxd': No such file or directory
LABS\rdo20@lamprey:~$ cd Sources/FortranTests/
LABS\rdo20@lamprey:~/Sources/FortranTests$
LABS\rdo20@lamprey:~/Sources/FortranTests$
LABS\rdo20@lamprey:~/Sources/FortranTests$
LABS\rdo20@lamprey:~/Sources/FortranTests$ strace -f ./shelxd
execve("./shelxd", ["./shelxd"], [/* 22 vars */]) = 0
uname({sys="Linux", node="lamprey", ...}) = 0
brk(0)                                  = 0x2b01000
brk(0x2b02170)                          = 0x2b02170
arch_prctl(ARCH_SET_FS, 0x2b01860)      = 0
set_tid_address(0x2b01b30)              = 9546
set_robust_list(0x2b01b40, 24)          = 0
futex(0x7fff0bbf5eec, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fff0bbf5eec, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 2b01860) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x1982830, [], SA_RESTORER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x1982760, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(0x2b23170)                          = 0x2b23170
brk(0x2b24000)                          = 0x2b24000
rt_sigaction(SIGFPE, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGILL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGSEGV, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGABRT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], 0}, 8) = 0
futex(0x1fcd1a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/proc/stat", O_RDONLY|O_CLOEXEC)  = 3
read(3, "cpu  6102139 110963 583317 14748"..., 8192) = 2480
close(3)                                = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7ff00db000
sched_getaffinity(0, 1048576, {ff, 0, 0, 0}) = 32
sched_setaffinity(0, 32, 0)             = -1 EFAULT (Bad address)
munmap(0x7f7ff00db000, 1052672)         = 0
rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGQUIT, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGILL, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGABRT, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGFPE, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGSEGV, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, NULL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, 8) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 0}, ru_stime={0, 0}, ...}) = 0
sched_getaffinity(0, 32, {ff, 0, 0, 0}) = 32
sched_setaffinity(0, 32, {1, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {2, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {4, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {8, 0, 0, 0})  = 0
sched_setaffinity(0, 32, {10, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {20, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {40, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {80, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {ff, 0, 0, 0}) = 0
sched_setaffinity(0, 32, {ff, 0, 0, 0}) = 0
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fee357000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fec4d2000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fea64d000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fe87c8000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fe6943000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fe4abe000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fe2c39000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fe0db4000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fdef2f000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fdd0aa000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fdcd9c000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fdaf17000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fd9092000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fd720d000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fd5388000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fd3503000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fd167e000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fcf7f9000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fcd974000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fcbaef000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fc9c6a000
mmap(NULL, 32002048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fc7de5000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fc7ad7000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fc77c9000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fc74bb000
mmap(NULL, 3203072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fc71ad000
open("/etc/localtime", O_RDONLY)        = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fc71ac000
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\7\0\0\0\7\0\0\0\0"..., 4096) = 3661
lseek(3, -2338, SEEK_CUR)               = 1323
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\10\0\0\0\0"..., 4096) = 2338
close(3)                                = 0
munmap(0x7f7fc71ac000, 4096)            = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
readlink("/proc/self/fd/1", "/dev/pts/12", 4095) = 11
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 12), ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "\n", 1
)                       = 1
write(1, "  ++++++++++++++++++++++++++++++"..., 70  +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
) = 70
write(1, "  +  SHELXD 2013/2     MACROMOLE"..., 70  +  SHELXD 2013/2     MACROMOLECULAR DIRECT AND PATTERSON METHODS  +
) = 70
write(1, "  +  Copyright(c) George M. Shel"..., 70  +  Copyright(c) George M. Sheldrick 2000-2013  Multi-CPU version  +
) = 70
write(1, "  +                             "..., 70  +                             started at 21:34:29 on 07 Dec 2016  +
) = 70
write(1, "  ++++++++++++++++++++++++++++++"..., 70  +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
) = 70
write(1, "\n", 1
)                       = 1
write(1, "  8 threads running in parallel "..., 43  8 threads running in parallel on  8 CPUs
) = 43
write(1, "\n", 1
)                       = 1
write(1, " The instructions are in name.in"..., 70 The instructions are in name.ins and the reflection data in name.hkl
) = 70
write(1, "\n", 1
)                       = 1
write(1, " Command line switches:\n", 24 Command line switches:
) = 24
write(1, " -tN use only N threads (default"..., 43 -tN use only N threads (default: use all)
) = 43
write(1, " -LN reserve working space for 1"..., 67 -LN reserve working space for 100000N reflections (default: -L10)
) = 67
write(1, "\n", 1
)                       = 1
exit_group(0)                           = ?
+++ exited with 0 +++
LABS\rdo20@lamprey:~/Sources/FortranTests$

and here it is on my WSL PC:

root@Pauli:~/LinuxTest# strace -f shelxd
execve("/mnt/b/LinuxTest/ccp4-7.0/bin/shelxd", ["shelxd"], [/* 36 vars */]) = 0
uname({sys="Linux", node="Pauli", ...}) = 0
brk(0)                                  = 0x333e000
brk(0x333f110)                          = 0x333f110
arch_prctl(ARCH_SET_FS, 0x333e800)      = 0
set_tid_address(0x333ead0)              = 1694
set_robust_list(0x333eae0, 24)          = 0
futex(0x7fffd21bfffc, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fffd21bfffc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 333e800) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x1982830, [], SA_RESTORER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x1982760, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=8192*1024}) = 0
brk(0x3360110)                          = 0x3360110
brk(0x3361000)                          = 0x3361000
rt_sigaction(SIGFPE, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGILL, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGSEGV, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGABRT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], SA_RESTORER, 0x7f84a6226cb0}, 8) = 0
rt_sigaction(SIGINT, {0x18726a0, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x19816c0}, {SIG_DFL, [], SA_RESTORER, 0x7f84a6226cb0}, 8) = 0
futex(0x1fcd1a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0xffffffffff600000} ---
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0xffffffffff600000} ---
exit_group(174)                         = ?
+++ exited with 174 +++
root@Pauli:~/LinuxTest#

I'll see if I can make a small repro within the next few days with the intel compiler before its trial license expires.

@therealkenc
Copy link
Collaborator

therealkenc commented Dec 8, 2016

That sched_setaffinity(0, 32, 0) = -1 EFAULT (Bad address), which happens in your native Ubuntu traces too, looks suspicious in light of the fact that this is a garden variety segfault error. You might be onto something new here; maybe WSL is raising a signal while native Linux just returns EFAULT, and continues obliviously without the affinity actually being set. Strange your threaded WSL trace seemingly isn't "getting that far" though. But that SI_KERNEL signal is coming from somewhere.....

@Oeffner
Copy link
Author

Oeffner commented Dec 8, 2016

Thanks for the info. If you have any suggestions on what functions to call in a small repro that exercises sched_setaffinity and possibly should make it crash let me know.
The original program is written in Fortran and build with not the most recent compiler. Since I don't have access to the source code it's like groping in the dark.

@therealkenc
Copy link
Collaborator

therealkenc commented Dec 8, 2016

Normally you'd be screwed without the source, but perhaps not here. That sched_setaffinity() is bog wrong the way I'm reading it. Third parameter cpu_set_t *mask is null, which will cause instant kernel grief when it is dereferenced. Test case would be a one-liner I think. I'll check it out when I get a chance, or if you don't beat me to it. Worried about libc artifacts though. Up-thread you said "All of these programs are compiled as static binaries with the intel compiler. If compiling them using dynamic library loading they run fine". I have no idea what those guys are linking against. Which means you'll possibly go through the trouble of the test case only to find it "runs fine". Or not. Would need to see.

Edit -- tried the one-liner and sched_setaffinity() just returns EFAULT on WSL with glibc just like Real Linux. No SIGSEGV signal. No joy.

@therealkenc
Copy link
Collaborator

therealkenc commented Jan 11, 2018

Duping into #1462 because it has more details on SIGSEGV si_addr=0xffffffffff600000. Also changing the issue title to help searches; the problem is not futex() related.

@therealkenc therealkenc changed the title Program making futex system call crashing CCP4 segfaults for lack of vsyscall page Jan 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants