Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults when compiling in 32 bit on 64 bit Linux platform #186

Closed
danpovey opened this issue Jan 20, 2013 · 6 comments
Closed
Labels

Comments

@danpovey
Copy link

Guys,
When I ran my toolkit's tests using OpenBlas on a particular platform, I get certain hard-to-replicate segmentation faults in memory.c. These occur inconsistenly and not when I run in gdb; it's easier to set cores to dump and wait till that happens.

There is some information below.

svatava:matrix: gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/i686-linux/4.5.4/lto-wrapper
Target: i686-linux
Configured with: ../configure --build=i686-linux --with-arch=nocona --with-tune=core2 --with-thread=posix --with-as=/usr/local/bin/as --with-ld=/usr/local/bin/ld --with-system-zlib --program-suffix=-4.5
Thread model: posix
gcc version 4.5.4 (GCC)

BTW, the test code is not multi-threaded, and I configured OpenBLAS with:

make: Nothing to be done for `all'.
svatava:matrix: make test
Running matrix-lib-test ...... SUCCESS
Running kaldi-gpsr-test ...... SUCCESS
svatava:matrix: make test
Running matrix-lib-test .../bin/sh: line 1: 4758 Segmentation fault (core dumped) ./$x > /dev/null 2>&1
... FAIL
Running kaldi-gpsr-test ...... SUCCESS
make: *** [test] Error 1

svatava:matrix: gdb ./matrix-lib-test core.4758
GNU gdb (GDB) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /mnt/matylda6/jhu09/qpovey/sourceforge/kaldi/trunk/src/matrix/matrix-lib-test...done.
[New Thread 4763]
[New Thread 4758]
[New Thread 4762]
[New Thread 4764]
[New Thread 4761]
[New Thread 4760]
[New Thread 4759]
[New Thread 4765]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /homes/eva/q/qpovey/sourceforge/kaldi/trunk/tools/OpenBLAS/install/lib/libopenblas.so.0...done.
Loaded symbols for /homes/eva/q/qpovey/sourceforge/kaldi/trunk/tools/OpenBLAS/install/lib/libopenblas.so.0
Reading symbols from /usr/local/lib/libgfortran.so.3...done.
Loaded symbols for /usr/local/lib/libgfortran.so.3
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /usr/lib/libstdc++.so.6...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Core was generated by `./matrix-lib-test'.
Program terminated with signal 11, Segmentation fault.
#0 0xf6ea211b in alloc_mmap (address=0x0) at memory.c:433

433 *(long *)start = (long)start + PAGESIZE;
(gdb) up
#1 0xf6ea26be in blas_memory_alloc (procpos=2) at memory.c:987

987 map_address = (*func)((void *)base_address);
(gdb) p base_address
$1 = 0
(gdb) up
#2 0xf6ea34fb in blas_thread_server (arg=0x4) at blas_server.c:274

274 buffer = blas_memory_alloc(2);
(gdb) up
#3 0x4f08c832 in start_thread () from /lib/libpthread.so.0

(gdb) up
#4 0x4efcc4de in clone () from /lib/libc.so.6

(gdb) up
Initial frame selected; you cannot go up.
(gdb) p func
No symbol "func" in current context.
(gdb) down
#3 0x4f08c832 in start_thread () from /lib/libpthread.so.0

(gdb) down
#2 0xf6ea34fb in blas_thread_server (arg=0x4) at blas_server.c:274

274 buffer = blas_memory_alloc(2);
(gdb) down
#1 0xf6ea26be in blas_memory_alloc (procpos=2) at memory.c:987

987 map_address = (func)((void *)base_address);
(gdb) p func
$2 = (void *(
*)(void *)) 0xf350b344
(gdb) p base_address
$3 = 0
(gdb) down
#0 0xf6ea211b in alloc_mmap (address=0x0) at memory.c:433

433 (long *)start = (long)start + PAGESIZE;
(gdb) p start
$4 = 3956314112
(gdb) p (long
)start
$5 = (long ) 0xebd09000
(gdb) p *((long
)start)
Cannot access memory at address 0xebd09000
(gdb) list
428
429 start = (BLASULONG)map_address;
430 current = (SCALING - 1) * BUFFER_SIZE;
431
432 while(current > 0) {
433 (long *)start = (long)start + PAGESIZE;
434 start += PAGESIZE;
435 current -= PAGESIZE;
436 }
437
(gdb) p sizeof(long)
$6 = 4
(gdb) p sizeof(void
)
$7 = 4
(gdb) p map_address
$8 = (void *) 0xebd09000
(gdb) p memory
$9 = {{lock = 0, addr = 0xf4d0f000, pos = 0, used = 1, dummy = '\000' <repeats 47 times>}, {lock = 0, addr = 0x0, pos = -1, used = 1,
dummy = '\000' <repeats 47 times>}, {lock = 0, addr = 0x0, pos = -1, used = 1, dummy = '\000' <repeats 47 times>}, {lock = 0, addr = 0x0,
pos = -1, used = 1, dummy = '\000' <repeats 47 times>}, {lock = 0, addr = 0x0, pos = -1, used = 0,
dummy = '\000' <repeats 47 times>} <repeats 28 times>}
(gdb)

@danpovey
Copy link
Author

I was trying to put in the info RE how I configured OpenBLAS, and it submitted the issue.
Here is that information (part of a Makefile):

openblas_compiled:
-git clone git://github.com/xianyi/OpenBLAS
$(MAKE) PREFIX=pwd/OpenBLAS/install FC=gfortran $(fortran_opt) DEBUG=1 USE_THREAD=0 -C OpenBLAS all install

@danpovey
Copy link
Author

BTW, I don't know if the problem has anything to do with the following comment re SEGFAULT, in common_linux.h ?
Dan

static inline int my_mbind(void *addr, unsigned long len, int mode,
unsigned long *nodemask, unsigned long maxnode,
unsigned flags) {
#if defined (LOONGSON3B)
#if defined (64BIT)
return syscall(SYS_mbind, addr, len, mode, nodemask, maxnode, flags);
#else
return 0; //NULL Implementation on Loongson 3B 32bit.
#endif
#else
//Fixed randomly SEGFAULT when nodemask==NULL with above Linux 2.6.34
// unsigned long null_nodemask=0;
return syscall(SYS_mbind, addr, len, mode, nodemask, maxnode, flags);
#endif
}

@danpovey
Copy link
Author

Guys, The patch below fixes the problem for me. It looks like the "nodemask" argument to the
"mbind" call on Linux is not allowed to be NULL, and previously this was fixed, but at some point,
the fix broke some other code and someone reverted the change. This patch fixes it in a more
robust way, so that if the "nodemask" argument is non-NULL, it uses that, else the address of
a zero-valued long int.

From 09c398574079dd26379c62cddf02afc5bdcf327f Mon Sep 17 00:00:00 2001
From: Povey Daniel qpovey@svatava.fit.vutbr.cz
Date: Mon, 21 Jan 2013 01:08:12 +0100
Subject: [PATCH] Fixing common_linux.h RE segfault on Linux.


common_linux.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/common_linux.h b/common_linux.h
index 6766ff3..f3530e7 100644
--- a/common_linux.h
+++ b/common_linux.h
@@ -75,9 +75,10 @@ static inline int my_mbind(void *addr, unsigned long len, int mode,
return 0; //NULL Implementation on Loongson 3B 32bit.
#endif
#else
-//Fixed randomly SEGFAULT when nodemask==NULL with above Linux 2.6.34
-// unsigned long null_nodemask=0;

  • return syscall(SYS_mbind, addr, len, mode, nodemask, maxnode, flags);
  • // Fixed random SEGFAULT when nodemask==NULL with above Linux 2.6.34
  • unsigned long null_nodemask = 0;
  • return syscall(SYS_mbind, addr, len, mode,
  •               (nodemask ? nodemask : &null_nodemask), maxnode, flags);
    
    #endif
    }

1.7.9.6

@xianyi
Copy link
Collaborator

xianyi commented Jan 21, 2013

Hi @danpovey ,

This is a known issue in OpenBLAS. On some Linux kernel version, the m_bind cannot accept NULL, which is a bug in kernel. You can apply segfaults.patch to walk around this issue. For example,

patch -ruN < segfaults.patch

Xianyi

@danpovey
Copy link
Author

Wouldn't it be simpler just to always compile with my patch? This code is
very infrequently called so it is not an efficiency issue. I don't see how
a user is expected to know how to do what you propose. Also, "recent
Linux" is probably the most common platform your code will be compiled on,
so it can hardly be considered a special case.
Dan

On Mon, Jan 21, 2013 at 3:28 AM, Zhang Xianyi notifications@github.comwrote:

Hi @danpovey https://github.com/danpovey ,

This is a known issue in OpenBLAS. On some Linux kernel version, the
m_bind cannot accept NULL, which is a bug in kernel. You can apply
segfaults.patch to walk around this issue. For example,

patch -ruN < segfaults.patch

Xianyi


Reply to this email directly or view it on GitHubhttps://github.com//issues/186#issuecomment-12488394.

@xianyi
Copy link
Collaborator

xianyi commented Jan 22, 2013

Hi @danpovey,

OpenBLAS always set NULL for mbind, which set the memory policy about allocating the memory on the local node. Thus, it can improve the performance.

I think recent Linux kernel fixed this bug.

Xianyi

@xianyi xianyi closed this as completed Jun 30, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants