Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mainnet crashed with the message SIGSEGV: Illegal storage access. #2134

Closed
hylin911 opened this issue Dec 2, 2020 · 22 comments
Closed

Mainnet crashed with the message SIGSEGV: Illegal storage access. #2134

hylin911 opened this issue Dec 2, 2020 · 22 comments

Comments

@hylin911
Copy link

hylin911 commented Dec 2, 2020

Describe the bug
My validator crashed with the log.

To Reproduce
Steps to reproduce the behavior:

  1. Platform details (OS, architecture):
    Ubuntu, intel NUC, I run my own geth on the same machine.

  2. Branch/commit used:
    N/A, I git pull just before mainnet launch

  3. Commands being executed:
    ./run-mainnet-beacon-node.sh

  4. Relevant log lines:
    INF 2020-12-02 11:00:50.662+08:00 Slot end topics="beacnde" tid=19715 file=nimbus_beacon_node.nim:593 slot=4502 nextSlot=4503 head=2de96dac:4502 headEpoch=140 finalizedHead=a30f7f9f:4416 finalizedEpoch=138
    INF 2020-12-02 11:00:59.047+08:00 Slot start topics="beacnde" tid=19715 file=nimbus_beacon_node.nim:505 lastSlot=4502 scheduledSlot=4503 beaconTime=15h36s47ms933us254ns peers=160 head=2de96dac:4502 headEpoch=140 finalized=a30f7f9f:4416 finalizedEpoch=138
    INF 2020-12-02 11:01:01.446+08:00 Slot end topics="beacnde" tid=19715 file=nimbus_beacon_node.nim:593 slot=4503 nextSlot=4504 head=999cda45:4503 headEpoch=140 finalizedHead=a30f7f9f:4416 finalizedEpoch=138
    INF 2020-12-02 11:01:11.040+08:00 Slot start topics="beacnde" tid=19715 file=nimbus_beacon_node.nim:505 lastSlot=4503 scheduledSlot=4504 beaconTime=15h48s40ms196us286ns peers=160 head=999cda45:4503 headEpoch=140 finalized=a30f7f9f:4416 finalizedEpoch=138
    INF 2020-12-02 11:01:12.679+08:00 Slot end topics="beacnde" tid=19715 file=nimbus_beacon_node.nim:593 slot=4504 nextSlot=4505 head=6dce0a30:4504 headEpoch=140 finalizedHead=a30f7f9f:4416 finalizedEpoch=138
    INF 2020-12-02 11:01:23.044+08:00 Slot start topics="beacnde" tid=19715 file=nimbus_beacon_node.nim:505 lastSlot=4504 scheduledSlot=4505 beaconTime=15h1m44ms529us58ns peers=160 head=6dce0a30:4504 headEpoch=140 finalized=a30f7f9f:4416 finalizedEpoch=138
    INF 2020-12-02 11:01:23.542+08:00 Slot end topics="beacnde" tid=19715 file=nimbus_beacon_node.nim:593 slot=4505 nextSlot=4506 head=88ad56cb:4505 headEpoch=140 finalizedHead=a30f7f9f:4416 finalizedEpoch=138
    peers: 160 ❯ finalized: a30f7f9f:138 ❯ head: 88ad56cb:140:25 ❯ time: 140:25 (4505) ❯ sync: synced ETH: 96.31415611 Traceback (most recent call last, using override)
    /home/hy/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(614) signalHandler
    SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
I had seen the same crash before mainnet launches, thus I git pull and built the latest. So this is the second time I am seeing this. One before mainnet launch, one after.

@andytudhope
Copy link
Contributor

Can confirm my validator node crashed with the same issue

@JergenJerf
Copy link

I also had same issue (2 mainnet crashes so far)

@ChihChengLiang
Copy link

Can confirm too

  1. Platform details (OS, architecture):
    Ubuntu x86_64

  2. Branch/commit used:

HEAD detached at v1.0.3

  1. Commands being executed:

(graffiti masked)

WEB3_URL=ws://127.0.0.1:8546 ./run-mainnet-beacon-node.sh --graffiti="***" --metrics
  1. Relevant log lines:
INF 2020-12-17 08:43:07.002+08:00 Slot end                                   topics="beacnde" tid=2779 file=nimbus_beacon_node.nim:647 slot=111813 nextSlot=111814 head=367807ad:111813 headEpoch=3494 finalizedHead=51dd6350:111744 finalizedEpoch=3492
INF 2020-12-17 08:43:11.034+08:00 Slot start                                 topics="beacnde" tid=2779 file=nimbus_beacon_node.nim:557 lastSlot=111813 scheduledSlot=111814 beaconTime=2w1d12h42m48s34ms280us104ns peers=158 head=367807ad:111813 headEpoch=3494 finalized=51dd6350:111744 finalizedEpoch=3492
 peers: 158 ❯ finalized: 51dd6350:3492 ❯ head: 9abd4fca:3494:6 ❯ time: 3494:6 (111814) ❯ sync: synced                                                                                    ETH: 32.235041625 Traceback (most recent call last, using override)
/home/liangcc/projects/eth2/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(614) signalHandler
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

@andytudhope
Copy link
Contributor

andytudhope commented Dec 17, 2020

Can confirm another crash with this issue. The same OS and architecture as the above comment. It is also the same machine which crashed with a RangeError. Full versioning info:

Nimbus beacon node v1.0.3-91741326-stateofus
Copyright (c) 2019-2020 Status Research & Development GmbH

eth2 specification v1.0.0

Nim Compiler Version 1.2.6 [Linux: amd64]
Copyright (c) 2006-2020 by Andreas Rumpf

git hash: bf320ed172f74f60fd274338e82bdc9ce3520dd9
active boot switches: -d:release

Logs:

INF 2020-12-16 17:40:35.064+00:00 Slot start                                 topics="beacnde" tid=171473 file=nimbus_beacon_node.nim:557 lastSlot=109700 scheduledSlot=109701 beaconTime=2w1d5h40m12s64ms968us341ns peers=92 head=59dab8d5:109700 headEpoch=3428 finalized=3331cf56:109632 finalizedEpoch=3426
INF 2020-12-16 17:40:43.001+00:00 Slot end                                   topics="beacnde" tid=171473 file=nimbus_beacon_node.nim:647 slot=109701 nextSlot=109702 head=13e1e328:109701 headEpoch=3428 finalizedHead=3331cf56:109632 finalizedEpoch=3426
 peers: 93 ❯ finalized: 3331cf56:3426 ❯ head: 13e1e328:3428:5 ❯ time: 3428:5 (109701) ❯ sync: synced                                                                                                ETH: 32.116382803 Traceback (most recent call last, using override)
/home/ubuntu/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(614) signalHandler
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

@hylin911
Copy link
Author

Just happened again with v1.0.3

INF 2020-12-18 03:08:19.001+08:00 Slot end topics="beacnde" tid=1722 file=nimbus_beacon_node.nim:647 slot=117339 nextSlot=117340 head=d3abe649:117339 headEpoch=3666 finalizedHead=78112719:117247 finalizedEpoch=3663
INF 2020-12-18 03:08:23.061+08:00 Slot start topics="beacnde" tid=1722 file=nimbus_beacon_node.nim:557 lastSlot=117339 scheduledSlot=117340 beaconTime=2w2d7h8m61ms8us454ns peers=160 head=d3abe649:117339 headEpoch=3666 finalized=78112719:117247 finalizedEpoch=3663
peers: 160 ❯ finalized: 78112719:3664 ❯ head: 19907233:3666:28 ❯ time: 3666:28 (117340) ❯ sync: synced ETH: 96.729432406 Traceback (most recent call last, using override)
/home/hy/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(614) signalHandler
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

@hylin911
Copy link
Author

hylin911 commented Jan 9, 2021

Reproduced the issue again with v1.0.4

@hylin911
Copy link
Author

I just reproduced the issue with v1.0.6 with the following log:

29980 head=945cf5a9:329979 headEpoch=10311 finalizedHead=a60c25e3:329888 finalizedEpoch=10309
INF 2021-01-16 15:56:23.109+08:00 Slot start topics="beacnde" tid=6910 file=nimbus_beacon_node.nim:550 lastSlot=329979 scheduledSlot=329980 beaconTime=6w3d19h56m109ms873us809ns peers=160 head=945cf5a9:329979 headEpoch=10311 finalized=a60c25e3:329888 finalizedEpoch=10309
INF 2021-01-16 15:56:31.002+08:00 Slot end topics="beacnde" tid=6910 file=nimbus_beacon_node.nim:640 slot=329980 nextSlot=329981 head=b688e7c8:329980 headEpoch=10311 finalizedHead=a60c25e3:329888 finalizedEpoch=10309
peers: 160 ❯ finalized: a60c25e3:10309 ❯ head: b688e7c8:10311:28 ❯ time: 10311:28 (329980) ❯ sync: synced ETH: 97.69160784 scripts/run-beacon-node.sh: line 77: 6910 Segmentation fault (core dumped) build/${NBC_BINARY} --network=${NETWORK} --data-dir="${DATA_DIR}" --log-file="${DATA_DIR}/nbc_bn_$(date +"%Y%m%d%H%M%S").log" --web3-url="${WEB3_URL}" --tcp-port=$(( ${BASE_P2P_PORT} + ${NODE_ID} )) --udp-port=$(( ${BASE_P2P_PORT} + ${NODE_ID} )) --rpc --rpc-port=$(( ${BASE_RPC_PORT} +${NODE_ID} )) $@

@sinkingsugar
Copy link
Contributor

@kdeme I think this might be related to the sigsegv we fixed. Although logs are not so helpful. Do you think that's a possibility?

@hylin911
Copy link
Author

hylin911 commented Feb 7, 2021

My validator crashed with v1.0.7

The error message is different though.

484896 head=2d1fddd2:484895 headEpoch=15152 finalizedHead=68713699:484800 finalizedEpoch=15150
INF 2021-02-07 04:19:35.000+08:00 Slot start topics="beacnde" tid=20018 file=nimbus_beacon_node.nim:603 lastSlot=484895 scheduledSlot=484896 delay=219us55ns peers=160 head=2d1fddd2:484895 headEpoch=15152 finalized=68713699:484800 finalizedEpoch=15150 sync=synced
NOT 2021-02-07 04:19:36.768+08:00 Reached new finalization checkpoint topics="chaindag" tid=20018 file=chain_dag.nim:946 finalizedHead=53a7f3a8:484832 heads=1 newHead=81da483c:484896
INF 2021-02-07 04:19:43.124+08:00 Slot end topics="beacnde" tid=20018 file=nimbus_beacon_node.nim:574 slot=484896 nextSlot=484897 head=81da483c:484896 headEpoch=15153 finalizedHead=53a7f3a8:484832 finalizedEpoch=15151
peers: 160 ❯ finalized: 53a7f3a8:15151 ❯ head: 81da483c:15153:0 ❯ time: 15153:0 (484896) ❯ sync: synced ETH: 98.26583736 Traceback (most recent call last, using override)
/home/hy/nimbus-eth2/vendor/nim-libp2p/libp2p/stream/bufferstream.nim(350) main
/home/hy/nimbus-eth2/vendor/nim-libp2p/libp2p/stream/bufferstream.nim(343) NimMain
/home/hy/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1404) main
/home/hy/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(927) start
/home/hy/nimbus-eth2/???(2) run
/home/hy/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/home/hy/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: /home/hy/nimbus-eth2/beacon_chain/spec/network.nim(163, 12) newSubnets.len <= attachedValidators.len + 1 [AssertionError]

@tersec
Copy link
Contributor

tersec commented Feb 7, 2021

@hylin911 please open a new issue. That's not a SIGSEGV at all, and the causes are very different.

That said, that assertion isn't in v1.0.7 at all, according to https://github.com/status-im/nimbus-eth2/blob/v1.0.7/beacon_chain/spec/network.nim. 55ecb61#diff-0858753285dc19b7770d771e748b77d477ba6a312ce4ed36df2603d242cd4b88 / #2240 removed it, three weeks ago.

Specifically what commit are you on, and how did you update to v1.0.7?

@hylin911
Copy link
Author

hylin911 commented Feb 7, 2021

@hylin911 please open a new issue. That's not a SIGSEGV at all, and the causes are very different.

That said, that assertion isn't in v1.0.7 at all, according to https://github.com/status-im/nimbus-eth2/blob/v1.0.7/beacon_chain/spec/network.nim. 55ecb61#diff-0858753285dc19b7770d771e748b77d477ba6a312ce4ed36df2603d242cd4b88 / #2240 removed it, three weeks ago.

Specifically what commit are you on, and how did you update to v1.0.7?

Yup, I just found that I am on the tag v1.0.6.

I used the command "git pull && make update" and thought I moved to v1.0.7...

I will make another build and test it.

@stefantalpalaru
Copy link
Contributor

I used the command "git pull && make update" and thought I moved to v1.0.7...

See what branch you're on, with git status

@hylin911
Copy link
Author

hylin911 commented Feb 8, 2021

Crashed on v1.0.7 now


hy@hyeth-NUC8i7BEH:~/nimbus-eth2$ git log
commit 596b8c6 (HEAD -> stable, tag: v1.0.7, origin/testing, origin/stable)
Author: Zahary Karadjov zahary@gmail.com
Date: Thu Feb 4 17:28:54 2021 +0200

v1.0.7

490979 head=aa656436:490978 headEpoch=15343 finalizedHead=4d044a6c:490912 finalizedEpoch=15341
INF 2021-02-08 00:36:11.000+08:00 Slot start topics="beacnde" tid=31207 file=nimbus_beacon_node.nim:816 lastSlot=490978 scheduledSlot=490979 delay=195us735ns peers=158 head=aa656436:490978 headEpoch=15343 finalized=4d044a6c:490912 finalizedEpoch=15341 sync=synced
INF 2021-02-08 00:36:19.140+08:00 Slot end topics="beacnde" tid=31207 file=nimbus_beacon_node.nim:787 slot=490979 nextSlot=490980 head=98bb526c:490979 headEpoch=15343 finalizedHead=4d044a6c:490912 finalizedEpoch=15341
INF 2021-02-08 00:36:23.000+08:00 Slot start topics="beacnde" tid=31207 file=nimbus_beacon_node.nim:816 lastSlot=490979 scheduledSlot=490980 delay=870us924ns peers=158 head=98bb526c:490979 headEpoch=15343 finalized=4d044a6c:490912 finalizedEpoch=15341 sync=synced
peers: 158 ❯ finalized: 4d044a6c:15341 ❯ head: 89343078:15343:4 ❯ time: 15343:4 (490980) ❯ sync: synced ETH: 98.278690395 Traceback (most recent call last, using override)
/home/hy/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(614) signalHandler
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

@hylin911
Copy link
Author

hylin911 commented Feb 8, 2021

Just another crash

494472 head=b8f943ae:494471 headEpoch=15452 finalizedHead=d755916f:494400 finalizedEpoch=15450
INF 2021-02-08 12:14:47.000+08:00 Slot start topics="beacnde" tid=32682 file=nimbus_beacon_node.nim:816 lastSlot=494471 scheduledSlot=494472 delay=308us583ns peers=159 head=b8f943ae:494471 headEpoch=15452 finalized=d755916f:494400 finalizedEpoch=15450 sync=synced
INF 2021-02-08 12:14:55.062+08:00 Slot end topics="beacnde" tid=32682 file=nimbus_beacon_node.nim:787 slot=494472 nextSlot=494473 head=fb9f8af4:494472 headEpoch=15452 finalizedHead=d755916f:494400 finalizedEpoch=15450
peers: 159 ❯ finalized: d755916f:15450 ❯ head: fb9f8af4:15452:8 ❯ time: 15452:8 (494472) ❯ sync: synced ETH: 98.26972287 Traceback (most recent call last, using override)
whereami error: could not get the program's path on this platform.
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

@dryajov
Copy link
Member

dryajov commented Feb 8, 2021

@hylin911 please open a new issue. That's not a SIGSEGV at all, and the causes are very different.
That said, that assertion isn't in v1.0.7 at all, according to https://github.com/status-im/nimbus-eth2/blob/v1.0.7/beacon_chain/spec/network.nim. 55ecb61#diff-0858753285dc19b7770d771e748b77d477ba6a312ce4ed36df2603d242cd4b88 / #2240 removed it, three weeks ago.
Specifically what commit are you on, and how did you update to v1.0.7?

Yup, I just found that I am on the tag v1.0.6.

I used the command "git pull && make update" and thought I moved to v1.0.7...

I will make another build and test it.

You need to make sure you're either on the stable branch or checked out the v1.0.7 tag explicitly. I would recommend doing the later as it lives less room for error.

@stefantalpalaru
Copy link
Contributor

We could use some GDB backtraces.

If you're using a wrapper script, edit "scripts/run-beacon-node.sh" and change line 78 from exec ${WINPTY} build/${NBC_BINARY} \ to exec gdb --args ${WINPTY} build/${NBC_BINARY} \, then run your script again.

When you're dropped in the GDB prompt, run the r command, wait for the segfault, then run bt to get a backtrace and q to quit. At this point you can copy/paste the backtrace in the GitHub issue and remove the gdb --args part from that script.

@stefantalpalaru
Copy link
Contributor

Are you all using Geth?

@hylin911
Copy link
Author

Are you all using Geth?

Yes, I am running Geth on the same machine.

@hylin911
Copy link
Author

We could use some GDB backtraces.

If you're using a wrapper script, edit "scripts/run-beacon-node.sh" and change line 78 from exec ${WINPTY} build/${NBC_BINARY} \ to exec gdb --args ${WINPTY} build/${NBC_BINARY} \, then run your script again.

When you're dropped in the GDB prompt, run the r command, wait for the segfault, then run bt to get a backtrace and q to quit. At this point you can copy/paste the backtrace in the GitHub issue and remove the gdb --args part from that script.

Hi @stefantalpalaru ,

I just have time to test v1.0.7 again. It crashed after running about 3 hours. Below is the bt

Thread 1 "nimbus_beacon_n" received signal SIGSEGV, Segmentation fault.
rawAlloc__mE4QEVyMvGRVliDWDngZCQ (a=0x7ffff7c29790, requestedSize=48) at /home/hy/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/alloc.nim:783
warning: Source file is more recent than executable.
783 c.freeList = c.freeList.next
(gdb) bt
#0 rawAlloc__mE4QEVyMvGRVliDWDngZCQ (a=0x7ffff7c29790, requestedSize=48) at /home/hy/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/alloc.nim:783
#1 0x0000000000000055 in ?? ()
#2 0x00007ffff7c29728 in ?? ()
#3 0x00007ffff2599218 in ?? ()
#4 0x00005555578e5a40 in ?? ()
#5 0x0000000000000030 in ?? ()
#6 0x00007ffff7c29728 in ?? ()
#7 0x00005555578e5a40 in ?? ()
#8 0x0000000000000000 in ?? ()
(gdb) q
A debugging session is active.

    Inferior 1 [process 77883] will be killed.

Quit anyway? (y or n) y

@hylin911
Copy link
Author

Hmm.. I think I made another build while the problematic one is running. Let me test again to see if I can recover the stack.

@stefantalpalaru
Copy link
Contributor

Thanks!

Please try a make LOG_LEVEL=DEBUG ... build.

kdeme added a commit that referenced this issue Mar 4, 2021
This should practically solve the segfaults we have been seeing in
issue #2134
zah pushed a commit that referenced this issue Mar 4, 2021
This should practically solve the segfaults we have been seeing in
issue #2134
@kdeme
Copy link
Contributor

kdeme commented May 12, 2021

No more crashes reported since the Nim GC fixes were applied so I'll close this.

@kdeme kdeme closed this as completed May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants