Micro-forks are causing Segmentation faults #4884

ScottSallinen · 2018-07-26T16:23:12Z

2018-07-26T15:05:23.312 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block 4d397be19fc69563... #7886077 @ 2018-07-26T15:05:20.000 signed by eosriobrazil [trxs: 1, lib: 7885757, conf: 0, latency: 3312 ms]
2018-07-26T15:05:25.372 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block d85bb8110e9f9d6f... #7886078 @ 2018-07-26T15:05:24.000 signed by eosswedenorg [trxs: 1005, lib: 7885769, conf: 228, latency: 1372 ms]
2018-07-26T15:05:25.416 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block 7330b3eafae6d838... #7886079 @ 2018-07-26T15:05:24.500 signed by eosswedenorg [trxs: 0, lib: 7885769, conf: 0, latency: 916 ms]
2018-07-26T15:05:25.422 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block e1658f16f0e3f94b... #7886080 @ 2018-07-26T15:05:25.000 signed by eosswedenorg [trxs: 1, lib: 7885769, conf: 0, latency: 422 ms]
2018-07-26T15:05:25.455 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block 48d19d4ae1a2d820... #7886078 @ 2018-07-26T15:05:23.500 signed by eosriobrazil [trxs: 694, lib: 7885769, conf: 0, latency: 1955 ms]
Segmentation fault (core dumped)

As of release 1.1.1, Multiple servers (consensus only, producer, and API) have recently crashed due to this issue -- it hasn't been a one time thing, it seems to be recurrent (has happened right after a replay).
Behaviour is strange, but consistent. It seems to occur with:

TX Heavy Blocks
Forking on switching producers due to latency between them (likely due to the TX heavy blocks)
On receipt of a block that already got microforked off due to the swap

Something isn't being done correctly with the fork database. Theres no bad alloc so it doesn't seem to be an issue with the reversible block db size (it also is currently set at 4GB, far higher than the default 340MB). Overflowed TX queuing somewhere?

Since this error has been recurrent for us, I'm happy to try adding debug flags / gdb / etc. if there's some flags that can help you guys track this issue.

The text was updated successfully, but these errors were encountered:

wanderingbort · 2018-07-26T16:36:17Z

Thanks for the report, would it be possible to get a full breakdown of OS, hardware specifics, sanitized config.ini and any other relevant settings? Given the nature of it's repeatability for those who get this and lack thereof for others, I suspect there is an environmental state or configuration required to repro.

…

On Thu, Jul 26, 2018, 12:23 PM Scott Sallinen ***@***.***> wrote: 2018-07-26T15:05:23.312 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 4d397be19fc69563... #7886077 @ 2018-07-26T15:05:20.000 signed by eosriobrazil [trxs: 1, lib: 7885757, conf: 0, latency: 3312 ms] 2018-07-26T15:05:25.372 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block d85bb8110e9f9d6f... #7886078 @ 2018-07-26T15:05:24.000 signed by eosswedenorg [trxs: 1005, lib: 7885769, conf: 228, latency: 1372 ms] 2018-07-26T15:05:25.416 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 7330b3eafae6d838... #7886079 @ 2018-07-26T15:05:24.500 signed by eosswedenorg [trxs: 0, lib: 7885769, conf: 0, latency: 916 ms] 2018-07-26T15:05:25.422 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block e1658f16f0e3f94b... #7886080 @ 2018-07-26T15:05:25.000 signed by eosswedenorg [trxs: 1, lib: 7885769, conf: 0, latency: 422 ms] 2018-07-26T15:05:25.455 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 48d19d4ae1a2d820... #7886078 @ 2018-07-26T15:05:23.500 signed by eosriobrazil [trxs: 694, lib: 7885769, conf: 0, latency: 1955 ms] Segmentation fault (core dumped) As of release 1.1.1, Multiple servers (consensus only, producer, and API) have recently crashed due to this issue -- it hasn't been a one time thing, it seems to be recurrent (has happened right after a replay). Behaviour is strange, but consistent. It seems to occur with: - TX Heavy Blocks - Forking on switching producers due to latency between them (likely due to the TX heavy blocks) - On receipt of a block that already got microforked off due to the swap Something isn't being done correctly with the fork database. Theres no bad alloc so it doesn't seem to be an issue with the reversible block db size (it also is currently set at 4GB, far higher than the default 340MB). Overflowed TX queuing somewhere? Since this error has been recurrent for us, I'm happy to try adding debug flags / gdb / etc. if there's some flags that can help you guys track this issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4884>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACYR4lIu9VOHelipQMmfUMdh79x9AI_gks5uKez_gaJpZM4ViJpV> .

wanderingbort · 2018-07-26T16:36:57Z

A stack trace would be nice if you can manage to get one with any symbols in it.

…

On Thu, Jul 26, 2018, 12:23 PM Scott Sallinen ***@***.***> wrote: 2018-07-26T15:05:23.312 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 4d397be19fc69563... #7886077 @ 2018-07-26T15:05:20.000 signed by eosriobrazil [trxs: 1, lib: 7885757, conf: 0, latency: 3312 ms] 2018-07-26T15:05:25.372 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block d85bb8110e9f9d6f... #7886078 @ 2018-07-26T15:05:24.000 signed by eosswedenorg [trxs: 1005, lib: 7885769, conf: 228, latency: 1372 ms] 2018-07-26T15:05:25.416 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 7330b3eafae6d838... #7886079 @ 2018-07-26T15:05:24.500 signed by eosswedenorg [trxs: 0, lib: 7885769, conf: 0, latency: 916 ms] 2018-07-26T15:05:25.422 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block e1658f16f0e3f94b... #7886080 @ 2018-07-26T15:05:25.000 signed by eosswedenorg [trxs: 1, lib: 7885769, conf: 0, latency: 422 ms] 2018-07-26T15:05:25.455 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 48d19d4ae1a2d820... #7886078 @ 2018-07-26T15:05:23.500 signed by eosriobrazil [trxs: 694, lib: 7885769, conf: 0, latency: 1955 ms] Segmentation fault (core dumped) As of release 1.1.1, Multiple servers (consensus only, producer, and API) have recently crashed due to this issue -- it hasn't been a one time thing, it seems to be recurrent (has happened right after a replay). Behaviour is strange, but consistent. It seems to occur with: - TX Heavy Blocks - Forking on switching producers due to latency between them (likely due to the TX heavy blocks) - On receipt of a block that already got microforked off due to the swap Something isn't being done correctly with the fork database. Theres no bad alloc so it doesn't seem to be an issue with the reversible block db size (it also is currently set at 4GB, far higher than the default 340MB). Overflowed TX queuing somewhere? Since this error has been recurrent for us, I'm happy to try adding debug flags / gdb / etc. if there's some flags that can help you guys track this issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4884>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACYR4lIu9VOHelipQMmfUMdh79x9AI_gks5uKez_gaJpZM4ViJpV> .

ScottSallinen · 2018-07-26T16:49:44Z

Some Ubuntu 18.04, some 16.04, and across multiple CPU platforms (i7 7700k and Xeons on bare metal, and Xeon VMs).
Config pastebin for one of the nodes: https://pastebin.com/r8hTib1K

Other nodes have similar configs (removed action-blacklist on that node and it still crashed due to this issue.)

Is there a good way to get the stack trace printed out for nodeos?

heifner · 2018-07-26T17:44:28Z

coredumpctrl dump nodeos -o core
gdb path_to_eosio/programs/nodeos/nodeos core
bt

ScottSallinen · 2018-07-26T18:16:53Z

Couldn't seem to get the coredumps from the previous crashes, but I'll run through gdb and see what I can do!

ScottSallinen · 2018-07-26T19:24:22Z

Here's a bt full from my newest crash:
https://pastebin.com/11b6iEFb

heifner · 2018-07-26T19:27:21Z

Can you get bt from all threads.
info threads
thread #
bt

ScottSallinen · 2018-07-26T19:30:59Z

Here's from thread 7, the starred one. Let me know if you want a different one.
https://pastebin.com/WfXveDU9

heifner · 2018-07-26T19:33:32Z

How about #5

ScottSallinen · 2018-07-26T19:36:06Z

Thread 5:
https://pastebin.com/TY1YNeBD

ScottSallinen · 2018-07-26T19:38:11Z

Looks like theres some stuff with bnet there. Wondering if there's some kind of race condition due to the node mixing both bnet and regular p2p connections.

heifner · 2018-07-26T19:46:36Z

I notice you are not connecting to any bnet peers. Any idea how many are connecting to this node?

ScottSallinen · 2018-07-26T19:48:44Z

This was from one public seed, which has least 4 nodes in our private Greymass infrastructure connected to it via bnet, (seeds, apis, and producer), plus anyone who connects to it via the publicly displayed address.

heifner · 2018-07-26T21:05:13Z

How much memory do you have on this machine? I see a bad_alloc.

ScottSallinen · 2018-07-26T21:06:51Z

64GB, and the program is only using about 2GB of it.

ScottSallinen · 2018-07-26T21:07:24Z

It seems sometimes it bad_alloc's, sometimes it segfaults.

heifner · 2018-07-26T21:13:19Z

So you don't see the memory spike before it dies?

ScottSallinen · 2018-07-26T21:17:24Z

Hm, no, does not look like there is a memory spike.

ScottSallinen · 2018-07-27T04:28:31Z

Multiple mainnet nodes went down in a single event it seems. This is my backtrace for this one:
https://pastebin.com/9gK5sAVv

perduta · 2018-07-27T04:50:49Z

It might be just a coincidence, but Tokenika node seems to be segfaulting slower (it didn't overnight) with history plugin turned off.
EDIT: I'll turn systemd-coredump and let you know if we get something.

ScottSallinen · 2018-07-27T05:23:14Z

I've split up some nodes into different styles of configs, and have so far witnessed the following behaviour:

bnet only: does not seem to have crashed yet.
p2p only: does not seem to have crashed yet.
Mixed bnet and p2p: some nodes have crashed.

I will expand my tests to see if allowing multiple connections from the same IP affects things.

dfguo · 2018-07-27T06:02:17Z

My api node (running 1.0.7) crashed. The machine has 128G memory and ubuntu:18.04 nodeos running in docker. The state folder >64G at the time it crashed.

I have attached the config.ini: https://pastebin.com/27ihP63k

ScottSallinen · 2018-07-27T06:39:59Z

@dfguo I believe this issue is different -- your crash appears to be due to filter-on=* exceeding 65536 mb on the mainnet.

halsaphi · 2018-11-01T06:45:22Z

Closed as refers to old version of code.
Please refer to latest code and documentation:
https://github.com/EOSIO
https://developers.eos.io/
If problem persists with latest version please raise new issue.

jgiszczak mentioned this issue Jul 26, 2018

re-sync crashed - 3030000 block_validate_exception: Block exception #4859

Closed

andriantolie added the Support label Jul 27, 2018

andriantolie assigned sergmetelin Jul 27, 2018

Irvingyao mentioned this issue Aug 23, 2018

Unlinkable block #5363

Closed

halsaphi closed this as completed Nov 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro-forks are causing Segmentation faults #4884

Micro-forks are causing Segmentation faults #4884

ScottSallinen commented Jul 26, 2018

wanderingbort commented Jul 26, 2018 via email

wanderingbort commented Jul 26, 2018 via email

ScottSallinen commented Jul 26, 2018 •

edited

Loading

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 27, 2018

perduta commented Jul 27, 2018 •

edited

Loading

ScottSallinen commented Jul 27, 2018

dfguo commented Jul 27, 2018

ScottSallinen commented Jul 27, 2018

halsaphi commented Nov 1, 2018

Micro-forks are causing Segmentation faults #4884

Micro-forks are causing Segmentation faults #4884

Comments

ScottSallinen commented Jul 26, 2018

wanderingbort commented Jul 26, 2018 via email

wanderingbort commented Jul 26, 2018 via email

ScottSallinen commented Jul 26, 2018 • edited Loading

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

heifner commented Jul 26, 2018

ScottSallinen commented Jul 26, 2018

ScottSallinen commented Jul 27, 2018

perduta commented Jul 27, 2018 • edited Loading

ScottSallinen commented Jul 27, 2018

dfguo commented Jul 27, 2018

ScottSallinen commented Jul 27, 2018

halsaphi commented Nov 1, 2018

ScottSallinen commented Jul 26, 2018 •

edited

Loading

perduta commented Jul 27, 2018 •

edited

Loading