Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using OpenBLAS in an OpenMP Application #85

Closed
grisuthedragon opened this issue Apr 2, 2012 · 18 comments
Closed

Error using OpenBLAS in an OpenMP Application #85

grisuthedragon opened this issue Apr 2, 2012 · 18 comments
Assignees
Labels

Comments

@grisuthedragon
Copy link
Contributor

I'll got an error using OpenBLAS (master and development branch) in an OpenMP application.
The programm crashes with a segmentation fault and gdb gives me the following:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1a4c700 (LWP 12429)]
?? () at ../kernel/x86_64/copy_sse2.S:592 from /scratch/koehlerm/mess/OpenBLAS/libopenblas.so.0
592 movhps (X), %xmm0
(gdb) bt
#0 ?? () at ../kernel/x86_64/copy_sse2.S:592 from /scratch/koehlerm/mess/OpenBLAS/libopenblas.so.0
#1 0x00007ffff5737bee in ger_kernel (args=0x7fffffffbb00, range_m=0x100000000, range_n=0x7fffffffbb98, dummy1=0x7fffe6d4f080, buffer=0x7fffe6e4f080, pos=3) at ger_thread.c:88
#2 0x00007ffff5b229d6 in exec_threads (queue=0x7fffffffba58) at blas_server_omp.c:240
#3 0x00007ffff5b22b25 in exec_blas.omp_fn.0 (.omp_data_i=0x7fffffffb7f0) at blas_server_omp.c:268
#4 0x00007ffff49ce7ca in gomp_thread_start (xdata=Unhandled dwarf expression opcode 0xf3

) at ../../../libgomp/team.c:116
#5 0x00007ffff500a9ca in start_thread (arg=) at pthread_create.c:300
#6 0x00007ffff472970d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()

(gdb) list
587 ALIGN_3
588
589 .L41:
590 movsd (X), %xmm0
591 addq INCX, X
592 movhps (X), %xmm0
593 addq INCX, X
594 movsd (X), %xmm1
595 addq INCX, X
596 movhps (X), %xmm1
(gdb)

OpenBLAS is compiled with:

  • BINARY=64
  • INTERFACE = 64
  • USE_OPENMP=1
  • DEBUG=1
@xianyi
Copy link
Collaborator

xianyi commented Apr 2, 2012

Hi,

Thank you for this report.
What's your CPU, OS and compiler version? Did you used 64-bit int in
your application?

Could you give me some test codes to reproduce this error? You can send
it to my email traits.zhang at gmail.com

Thank you again.

Xianyi

grisuthedragon write:Sent: 12-4-2 Afternoon 3:56

I'll got an error using OpenBLAS (master and development branch) in an OpenMP application.
The programm crashes with a segmentation fault and gdb gives me the following:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1a4c700 (LWP 12429)]
?? () at ../kernel/x86_64/copy_sse2.S:592 from /scratch/koehlerm/mess/OpenBLAS/libopenblas.so.0
592 movhps (X), %xmm0
(gdb) bt
#0 ?? () at ../kernel/x86_64/copy_sse2.S:592 from /scratch/koehlerm/mess/OpenBLAS/libopenblas.so.0
#1 0x00007ffff5737bee in ger_kernel (args=0x7fffffffbb00, range_m=0x100000000, range_n=0x7fffffffbb98, dummy1=0x7fffe6d4f080, buffer=0x7fffe6e4f080, pos=3) at ger_thread.c:88
#2 0x00007ffff5b229d6 in exec_threads (queue=0x7fffffffba58) at blas_server_omp.c:240
#3 0x00007ffff5b22b25 in exec_blas.omp_fn.0 (.omp_data_i=0x7fffffffb7f0) at blas_server_omp.c:268
#4 0x00007ffff49ce7ca in gomp_thread_start (xdata=Unhandled dwarf expression opcode 0xf3
) at ../../../libgomp/team.c:116
#5 0x00007ffff500a9ca in start_thread (arg=) at pthread_create.c:300
#6 0x00007ffff472970d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7 0x0000000000000000 in ?? ()
(gdb) list
587 ALIGN_3
588
589 .L41:
590 movsd (X), %xmm0
591 addq INCX, X
592 movhps (X), %xmm0
593 addq INCX, X
594 movsd (X), %xmm1
595 addq INCX, X
596 movhps (X), %xmm1
(gdb)

OpenBLAS is compiled with:

  • BINARY=64
  • INTERFACE = 64
  • USE_OPENMP=1
  • DEBUG=1

Reply to this email directly or view it on GitHub:
#85

@grisuthedragon
Copy link
Contributor Author

I use Ubuntu 10.04 LTS

$ gcc --version
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3

$ cat /proc/cpuinfo | grep "model name" | head -n1
model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz

I use 64 bit integers every where, unfortunately I don't have a working minimal example. But the code runs using OpenBLAS moren then a half year and now it crashed.

@xianyi
Copy link
Collaborator

xianyi commented Apr 3, 2012

Hi,

From the following,

#0 ?? () at ../kernel/x86_64/copy_sse2.S:592 from /scratch/koehlerm/mess/OpenBLAS/libopenblas.so.0

#1 0x00007ffff5737bee in ger_kernel (args=0x7fffffffbb00,
range_m=0x100000000, range_n=0x7fffffffbb98, dummy1=0x7fffe6d4f080,
buffer=0x7fffe6e4f080, pos=3) at ger_thread.c:88

I think it crashed in dger function. Could you give more information
about "args=0x7fffffffbb00, range_m=0x100000000, range_n=0x7fffffffbb98"?

args is a structure pointer, range_m and range_n is a int arrary.

Thank you

Xianyi

grisuthedragon write:Sent: 12-4-3 Afternoon 4:41

I use Ubuntu 10.04 LTS

$ gcc --version
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3

$ cat /proc/cpuinfo | grep "model name" | head -n1
model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz

I use 64 bit integers every where, unfortunately I don't have a working minimal example. But the code runs using OpenBLAS moren then a half year and now it crashed.


Reply to this email directly or view it on GitHub:
#85 (comment)

@ghost ghost assigned xianyi Apr 5, 2012
@aeberspaecher
Copy link
Contributor

I have crashes, too.

Unfortunately, I cannot provide any helpful gdb output as the code that crashes is a Python module using OpenBlas and I don't have all debugging symbols at hand.

Maybe helpful: I did not have crashes with 0.1 alpha 2.5, the crashes only started after the upgrade to 0.1.

CPU: model name : Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz

@xianyi
Copy link
Collaborator

xianyi commented Apr 6, 2012

What's your CPU and OS? 64-bit or 32?

Xianyi

在 2012-4-5,下午10:56,Alexander Eberspächerreply@reply.github.com 写道:

I have crashes, too.

Unfortunately, I cannot provide any helpful gdb output as the code that crashes is a Python module using OpenBlas and I don't have all debugging symbols at hand.

Maybe helpful: I did not have crashes with 0.1 alpha 2.5, the crashes only started after the upgrade to 0.1.


Reply to this email directly or view it on GitHub:
#85 (comment)

@aeberspaecher
Copy link
Contributor

CPU: model name : Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
OS: Scientific Linux release 6.2 (Carbon) (64bit)

I am happy to provide any further information you need.

Remark: the tests after building pass. The crashes do not occur on every run of my code.

@xianyi
Copy link
Collaborator

xianyi commented Apr 6, 2012

Hi grisuthedragon,

I just tested OpenBLAS with INTERFACE = 64 and USE_OPENMP=1. I cannot reproduce dger or copy errors.
Could you give more information about dger input args "args=0x7fffffffbb00, range_m=0x100000000, range_n=0x7fffffffbb98"?

Thanks
Xianyi

@xianyi
Copy link
Collaborator

xianyi commented Apr 6, 2012

Hi Alexander,

Could you build OpenBLAS with DEBUG=1. Then, enable the core dump as following.
ulimit -c unlimited

Next, you can run the program until it crashes.

gdb your_program core

This will show which function crashed. You can also type "bt" to show the function trace.

Xianyi

@aeberspaecher
Copy link
Contributor

Xianyi, unfortunately I cannot find out which function leads to the crash. I do not have all required debugging symbols at hand (I run a Python script that uses a Python module created from Fortran using f2py, which itself calls OpenBlas). My code has an awful lot of dependencies for which I cannot get debugging symbols.

All I have in the backtrace is

#0 0x00007f451200269c in ?? ()
#1 0x00007f4512d4ce20 in ?? ()
#2 0x0000000000000000 in ?? ()

Please let me know if I can help in any other way.

@xianyi
Copy link
Collaborator

xianyi commented Apr 13, 2012

Hi Alexander,

Do you know your application using shared OpenBLAS library or static library?
If it use the shared library, you can replace it with your own debug version.

Xianyi

@xianyi
Copy link
Collaborator

xianyi commented Apr 13, 2012

Hi Alexander,

Please test this:

export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1
./your_application

Xianyi

@aeberspaecher
Copy link
Contributor

Dear Xianyi,

I use the shared object version of OpenBlas. I already had replaced the .so file with the debug version. However, gdb cannot tell me which function crashed.

However, the problem seems to vanish if I run my script with OMP_NUM_THREADS=1. At least that's what I infer here - i tested several runs and saw no crashes. With higher OMP_NUM_THREADS I have crashes in about 30 to 50% of runs.

@aeberspaecher
Copy link
Contributor

The segfaults are still there on the develop branch.

If I have segfaults, they appear immediately after my code is run. Even before the first BLAS routine can was called. Maybe that helps to narrow it down. Could there be something wrong in the build process?

@xianyi
Copy link
Collaborator

xianyi commented Apr 27, 2012

Thank you for this information

I think it may be relate to memory allocation or thread creation.

Xianyi

在 2012年4月27日 下午3:54,Alexander Eberspächer <
reply@reply.github.com

写道:

The segfaults are still there on the develop branch.

If I have segfaults, they appear immediately after my code is run. Even
before the first BLAS routine can was called. Maybe that helps to narrow it
down. Could there be something wrong in the build process?


Reply to this email directly or view it on GitHub:
#85 (comment)

@xianyi
Copy link
Collaborator

xianyi commented Apr 28, 2012

Hi Alexander,

Please help me do the following 2 experiments:

  1. Build OpenBLAS with pthread instead of OpenMP. Then, you can test your application.

  2. try apply this patch. (https://gist.github.com/2517475)
    Build OpenBLAS with OpenMP and test it.

What't your Linux kernel version?

Thank

Xianyi

@aeberspaecher
Copy link
Contributor

Hi Xianyi,

I applied your patch. The crashes are gone! I am using kernel 2.6.32-220.7.1 64-bit on Scientific Linux 6.2.

If you are still interested in the outcome of the first experiment, please let me know. I am happy to test that as well.

Thanks for the patch,

Alex

@xianyi
Copy link
Collaborator

xianyi commented May 2, 2012

Hi Alex,

I met this crash 1 year ago. It may be relate to a kernel bug. Then, I
applied this patch into OpenBLAS. However, when I test OpenBLAS on some new
kernel version. this patch will be error or warning. Thus, I rolled back
the code.

I suggest you upgrade the kernel. I think the crash will be gone without
the patch.

Thank you

Xianyi

2012/5/2 Alexander Eberspächer <
reply@reply.github.com

Hi Xianyi,

I applied your patch. The crashes are gone! I am using kernel
2.6.32-220.7.1 64-bit on Scientific Linux 6.2.

If you are still interested in the outcome of the first experiment, please
let me know. I am happy to test that as well.

Thanks for the patch,

Alex


Reply to this email directly or view it on GitHub:
#85 (comment)

@aeberspaecher
Copy link
Contributor

I had lots of compiler warning, too.

Unfortunately, I cannot upgrade my kernel. I think there are many other users stuck on Redhat Enterprise Linux 6, CentOS 6 and Scientific Linux 6. All these users use the same kernel as I do. Given the importance of those distributions, it might be wise to document this patch. I could issue a pull request if needed. Just let me know if you think this is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants