Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error : Illegal instruction (core dumped) #24

Closed
hadyelsahar opened this issue Jun 1, 2015 · 4 comments
Closed

error : Illegal instruction (core dumped) #24

hadyelsahar opened this issue Jun 1, 2015 · 4 comments

Comments

@hadyelsahar
Copy link

I am trying to run the basic example in the tutorial using the CPU :
th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1

I get the following error :

loading data files...   
cutting off end of data so that the batches/sequences divide evenly 
reshaping tensor... 
data load done. Number of batches in train: 211, val: 11, test: 1   
vocab size: 65  
creating an LSTM with 2 layers  
number of parameters in the model: 154165   
cloning criterion   
cloning softmax 
cloning embed   
cloning rnn 
Illegal instruction (core dumped)
@krasin
Copy link

krasin commented Jun 3, 2015

@hadyelsahar usually, SIGILL happens when a binary contains an instruction that is not supported by the CPU. The common scenario for that is to compile a binary on one (newer) computer, then copy the binary to another (older) computer and run it there.

In your case, I would blindly guess that the computer does not support AVX2 instruction set, while the computer used for compilation did support it.

If you want to make it clear, which module and which instruction causes this segfault, I would recommend to run it under gdb:

gdb --args th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1
run

Once it happens, please, post the stack trace and disassembly here.

gdb commands:
stack trace: bt
disassembly of the current block: disas

The currently running instruction will be marked like "===> "

@hadyelsahar
Copy link
Author

Thanks for your help, It seems the problem with the vmovsd instruction

The Stack Trace :

#0  0x00007ffff532de50 in dgemm_oncopy () from /opt/OpenBLAS/lib/libopenblas.so.0
#1  0x0000000000000041 in ?? ()
#2  0x0000000000000026 in ?? ()
#3  0x00007ffff51cd0c7 in inner_thread () from /opt/OpenBLAS/lib/libopenblas.so.0
#4  0x00007ffff52da20c in blas_thread_server () from /opt/OpenBLAS/lib/libopenblas.so.0
#5  0x00007ffff7474182 in start_thread (arg=0x7ffff1b05700) at pthread_create.c:312
#6  0x00007ffff6f8b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

the disassembly of the current block :

Dump of assembler code for function dgemm_oncopy:
   0x00007ffff532de00 <+0>: push   %r13
   0x00007ffff532de02 <+2>: push   %r12
   0x00007ffff532de04 <+4>: lea    0x0(,%rcx,8),%rcx
   0x00007ffff532de0c <+12>:    mov    %rsi,%r10
   0x00007ffff532de0f <+15>:    sar    %r10
   0x00007ffff532de12 <+18>:    jle    0x7ffff532dfd0 <dgemm_oncopy+464>
   0x00007ffff532de18 <+24>:    nopl   0x0(%rax,%rax,1)
   0x00007ffff532de20 <+32>:    mov    %rdx,%r11
   0x00007ffff532de23 <+35>:    lea    (%rdx,%rcx,1),%r12
   0x00007ffff532de27 <+39>:    lea    (%rdx,%rcx,2),%rdx
   0x00007ffff532de2b <+43>:    mov    %rdi,%r9
   0x00007ffff532de2e <+46>:    sar    $0x3,%r9
   0x00007ffff532de32 <+50>:    jle    0x7ffff532df10 <dgemm_oncopy+272>
   0x00007ffff532de38 <+56>:    nopl   0x0(%rax,%rax,1)
   0x00007ffff532de40 <+64>:    prefetchw 0x100(%r8)
   0x00007ffff532de48 <+72>:    prefetchnta 0x100(%r11)
=> 0x00007ffff532de50 <+80>:    vmovsd (%r11),%xmm0
   0x00007ffff532de55 <+85>:    vmovsd 0x8(%r11),%xmm1
   0x00007ffff532de5b <+91>:    vmovsd 0x10(%r11),%xmm2
   0x00007ffff532de61 <+97>:    vmovsd 0x18(%r11),%xmm3
   0x00007ffff532de67 <+103>:   vmovsd 0x20(%r11),%xmm4
   0x00007ffff532de6d <+109>:   vmovsd 0x28(%r11),%xmm5
   0x00007ffff532de73 <+115>:   vmovsd 0x30(%r11),%xmm6
   0x00007ffff532de79 <+121>:   vmovsd 0x38(%r11),%xmm7
   0x00007ffff532de7f <+127>:   prefetchnta 0x100(%r12)
   0x00007ffff532de88 <+136>:   vmovhpd (%r12),%xmm0,%xmm0
   0x00007ffff532de8e <+142>:   vmovhpd 0x8(%r12),%xmm1,%xmm1
   0x00007ffff532de95 <+149>:   vmovhpd 0x10(%r12),%xmm2,%xmm2
   0x00007ffff532de9c <+156>:   vmovhpd 0x18(%r12),%xmm3,%xmm3
   0x00007ffff532dea3 <+163>:   vmovhpd 0x20(%r12),%xmm4,%xmm4
   0x00007ffff532deaa <+170>:   vmovhpd 0x28(%r12),%xmm5,%xmm5
   0x00007ffff532deb1 <+177>:   vmovhpd 0x30(%r12),%xmm6,%xmm6
   0x00007ffff532deb8 <+184>:   vmovhpd 0x38(%r12),%xmm7,%xmm7
   0x00007ffff532debf <+191>:   prefetchw 0x140(%r8)
   0x00007ffff532dec7 <+199>:   vmovups %xmm0,(%r8)
   0x00007ffff532decc <+204>:   vmovups %xmm1,0x10(%r8)
   0x00007ffff532ded2 <+210>:   vmovups %xmm2,0x20(%r8)
   0x00007ffff532ded8 <+216>:   vmovups %xmm3,0x30(%r8)
   0x00007ffff532dede <+222>:   vmovups %xmm4,0x40(%r8)

Just for reference if someone faced the same problem,the executable file of torch ~/torch/bin/th is a script not a binary so gdp can't actually debug a script.

file   /torch/install/bin/th 
th: POSIX shell script, ASCII text executable, with very long lines

so to work around it u'll need to execute the following:

gdb64 /bin/bash    # or check your gdb configuration either it's i686 or x86_64

from the gdb terminal run :

run th train.lua -data_dir data/tinyshakespeare/ -rnn_size 100 -num_layers 2 -dropout 0.5 -gpuid -1 

ps: i think this issue is related more to Torch more than this repo. , so feel free if you want me to move it there .

@krasin
Copy link

krasin commented Jun 5, 2015

Good data, @hadyelsahar!

According to #0 0x00007ffff532de50 in dgemm_oncopy () from /opt/OpenBLAS/lib/libopenblas.so.0, it's not even Torch to blame, but the installation of OpenBLAS. I would recommend to reinstall and/or investigate what was the procedure for its previous installation. It seems that a plain cp was involved.

@hadyelsahar
Copy link
Author

That makes sense now, OpenBlas failed to detect my processor configuration automatically.
so i edited the torch dependency download script according to what i have been told on this issue.

Although i've built and installed OpenBlas on my machine manually, but probably that hasn't fixed it. anyways let's see there.

Many Thanks
Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants