Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP2K: illegal instruction error #5795

Closed
0luhancheng0 opened this issue Jan 20, 2021 · 4 comments
Closed

CP2K: illegal instruction error #5795

0luhancheng0 opened this issue Jan 20, 2021 · 4 comments
Labels

Comments

@0luhancheng0
Copy link

Recipe

The recipe is generated using this python script

Error

Singularity exec

This error happens when i run

singularity exec --nv /usr/local/cp2k/8.1.0/cp2k.sif cp2k.psmp --help

seems to be hardware dependent, the container works fine on one of the cluster and breaks on the other one.

$ singularity exec --nv /usr/local/cp2k/8.1.0/cp2k.sif cp2k.psmp --help
SIGILL: illegal instruction
PC=0x47282b m=0 sigcode=0

goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x3cd4, 0x4, 0x0, 0xc00020fef0, 0x48f422, 0x3cd4)
    /usr/local/go/1.11.1/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc00020feb8 sp=0xc00020feb0 pc=0x47282b
syscall.Kill(0x3cd4, 0x4, 0x4377de, 0xc00020ff20)
    /usr/local/go/1.11.1/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc00020ff00 sp=0xc00020feb8 pc=0x46f1db
github.com/sylabs/singularity/internal/app/starter.Master.func4()
    internal/app/starter/master_linux.go:158 +0x3e fp=0xc00020ff38 sp=0xc00020ff00 pc=0x8d51be
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
    internal/pkg/util/mainthread/mainthread.go:20 +0x2f fp=0xc00020ff60 sp=0xc00020ff38 pc=0x87472f
main.main()
    cmd/starter/main_linux.go:102 +0x68 fp=0xc00020ff98 sp=0xc00020ff60 pc=0x8d59f8
runtime.main()
    /usr/local/go/1.11.1/src/runtime/proc.go:201 +0x207 fp=0xc00020ffe0 sp=0xc00020ff98 pc=0x42faa7
runtime.goexit()
    /usr/local/go/1.11.1/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00020ffe8 sp=0xc00020ffe0 pc=0x45b4f1

goroutine 19 [syscall]:
os/signal.signal_recv(0xaa2620)
    /usr/local/go/1.11.1/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
    /usr/local/go/1.11.1/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
    /usr/local/go/1.11.1/src/os/signal/signal_unix.go:29 +0x41

goroutine 4 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc00031b7b0)
    internal/pkg/util/mainthread/mainthread.go:23 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x4, 0x7, 0x1600, 0x3ce2, 0xc00000c0e0)
    internal/app/starter/master_linux.go:157 +0x44e
main.startup()
    cmd/starter/main_linux.go:73 +0x563
created by main.main
    cmd/starter/main_linux.go:98 +0x3e

rax    0x0
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x3cd4
rsi    0x4
rbp    0xc00020fef0
rsp    0xc00020feb0
r8     0x0
r9     0x0
r10    0x0
r11    0x206
r12    0xc
r13    0xff
r14    0xa972b8
r15    0x0
rip    0x47282b
rflags 0x206
cs     0x33
fs     0x0
gs     0x0

Singularity shell

I have also tried singularity shell
And if i first run singularity shell then cp2k.psmp the program gives illegal instruction error Illegal instruction (core dumped)

Then the same error pop up when I exit the container (ctrl-D when i inside singularity shell).

[luhanc@gp04 files_gpu]$ singularity shell --nv /usr/local/cp2k/8.1.0/cp2k.sif 
bash: warning: setlocale: LC_ALL: cannot change locale (en_AU.UTF-8)
Singularity cp2k.sif:~/files_gpu> psmp^C
Singularity cp2k.sif:~/files_gpu> cp2k.psmp --help
Illegal instruction (core dumped)
Singularity cp2k.sif:~/files_gpu
Singularity cp2k.sif:~/files_gpu> exit
exit
SIGILL: illegal instruction
PC=0x47282b m=0 sigcode=0

goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x4080, 0x4, 0x0, 0xc00020fef0, 0x48f422, 0x4080)
        /usr/local/go/1.11.1/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc00020feb8 sp=0xc00020feb0 pc=0x47282b
syscall.Kill(0x4080, 0x4, 0x4377de, 0xc00020ff20)
        /usr/local/go/1.11.1/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc00020ff00 sp=0xc00020feb8 pc=0x46f1db
github.com/sylabs/singularity/internal/app/starter.Master.func4()
        internal/app/starter/master_linux.go:158 +0x3e fp=0xc00020ff38 sp=0xc00020ff00 pc=0x8d51be
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
        internal/pkg/util/mainthread/mainthread.go:20 +0x2f fp=0xc00020ff60 sp=0xc00020ff38 pc=0x87472f
main.main()
        cmd/starter/main_linux.go:102 +0x68 fp=0xc00020ff98 sp=0xc00020ff60 pc=0x8d59f8
runtime.main()
        /usr/local/go/1.11.1/src/runtime/proc.go:201 +0x207 fp=0xc00020ffe0 sp=0xc00020ff98 pc=0x42faa7
runtime.goexit()
        /usr/local/go/1.11.1/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00020ffe8 sp=0xc00020ffe0 pc=0x45b4f1

goroutine 19 [syscall]:
os/signal.signal_recv(0xaa2620)
        /usr/local/go/1.11.1/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
        /usr/local/go/1.11.1/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
........

Version of Singularity: 3.2.1

What version of Singularity are you using? Run:

$ singularity version
3.2.1

What OS/distro are you running

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

How did you install Singularity

built from source.

@dtrudg
Copy link
Contributor

dtrudg commented Jan 20, 2021

This is not likely to be an issue with Singularity itself. However, note that you are using an outdated version of Singularity that is not supported, and was compiled with an unmaintained version of Go. Please update to the current version.

The container (and possible Singularity itself - I'm not certain) has been built so that the software is optimized for a specific class of CPU. Newer CPUs add new instructions, and these are commonly exploited by numerical libraries to improve speed. When run on a CPU that does not support the same instructions these errors can occur.

E.g. if you build on/for a CPU supporting AVX512, and then attempt to run on an older CPU that does not support AVX512 you will get an illegal instruction error when executing the program.

When building a container you must ensure that the build configuration supports the oldest / least featured CPU that the container will be run on.

What is the CPU on the system that was used to build the container?
What is the CPU on the system that the crash occurs on?

@0luhancheng0
Copy link
Author

Thanks for the reply @dtrudg,

The node that builds the container

[luhanc@monarch-dtn1 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                18
On-line CPU(s) list:   0-17
Thread(s) per core:    1
Core(s) per socket:    18
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Stepping:              4
CPU MHz:               2693.658
BogoMIPS:              5387.31
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0-17
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat pku ospke md_clear spec_ctrl intel_stibp

The node that crashes the container

[luhanc@gf00 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:              2
CPU MHz:               2499.988
BogoMIPS:              4999.97
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0-11
NUMA node1 CPU(s):     12-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear spec_ctrl intel_stibp

@dtrudg
Copy link
Contributor

dtrudg commented Jan 20, 2021

The Xeon Gold is a newer CPU than the Xeon v3, and it supports more instruction set extensions.

Likely some numerical code is built to use AVX512 instructions on the Xeon Gold. The Xeon v3 does not support this, only AVX2 - so any use of AVX512 instructions there will crash with an illegal instruction error.

You will need to investigate how to build the software in the container so that it does not use instructions that are not supported by the older CPU. This is not a container-specific or Singularity issue.

@dtrudg dtrudg closed this as completed Jan 20, 2021
@0luhancheng0
Copy link
Author

Thanks @dtrudg , I will try to rebuild the container on the older version of CPU node,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants