-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Micro-forks are causing Segmentation faults #4884
Comments
Thanks for the report, would it be possible to get a full breakdown of OS,
hardware specifics, sanitized config.ini and any other relevant settings?
Given the nature of it's repeatability for those who get this and lack
thereof for others, I suspect there is an environmental state or
configuration required to repro.
…On Thu, Jul 26, 2018, 12:23 PM Scott Sallinen ***@***.***> wrote:
2018-07-26T15:05:23.312 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 4d397be19fc69563... #7886077 @ 2018-07-26T15:05:20.000 signed by eosriobrazil [trxs: 1, lib: 7885757, conf: 0, latency: 3312 ms]
2018-07-26T15:05:25.372 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block d85bb8110e9f9d6f... #7886078 @ 2018-07-26T15:05:24.000 signed by eosswedenorg [trxs: 1005, lib: 7885769, conf: 228, latency: 1372 ms]
2018-07-26T15:05:25.416 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 7330b3eafae6d838... #7886079 @ 2018-07-26T15:05:24.500 signed by eosswedenorg [trxs: 0, lib: 7885769, conf: 0, latency: 916 ms]
2018-07-26T15:05:25.422 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block e1658f16f0e3f94b... #7886080 @ 2018-07-26T15:05:25.000 signed by eosswedenorg [trxs: 1, lib: 7885769, conf: 0, latency: 422 ms]
2018-07-26T15:05:25.455 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 48d19d4ae1a2d820... #7886078 @ 2018-07-26T15:05:23.500 signed by eosriobrazil [trxs: 694, lib: 7885769, conf: 0, latency: 1955 ms]
Segmentation fault (core dumped)
As of release 1.1.1, Multiple servers (consensus only, producer, and API)
have recently crashed due to this issue -- it hasn't been a one time thing,
it seems to be recurrent (has happened right after a replay).
Behaviour is strange, but consistent. It seems to occur with:
- TX Heavy Blocks
- Forking on switching producers due to latency between them (likely
due to the TX heavy blocks)
- On receipt of a block that already got microforked off due to the
swap
Something isn't being done correctly with the fork database. Theres no bad
alloc so it doesn't seem to be an issue with the reversible block db size
(it also is currently set at 4GB, far higher than the default 340MB).
Overflowed TX queuing somewhere?
Since this error has been recurrent for us, I'm happy to try adding debug
flags / gdb / etc. if there's some flags that can help you guys track this
issue.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4884>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACYR4lIu9VOHelipQMmfUMdh79x9AI_gks5uKez_gaJpZM4ViJpV>
.
|
A stack trace would be nice if you can manage to get one with any symbols
in it.
…On Thu, Jul 26, 2018, 12:23 PM Scott Sallinen ***@***.***> wrote:
2018-07-26T15:05:23.312 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 4d397be19fc69563... #7886077 @ 2018-07-26T15:05:20.000 signed by eosriobrazil [trxs: 1, lib: 7885757, conf: 0, latency: 3312 ms]
2018-07-26T15:05:25.372 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block d85bb8110e9f9d6f... #7886078 @ 2018-07-26T15:05:24.000 signed by eosswedenorg [trxs: 1005, lib: 7885769, conf: 228, latency: 1372 ms]
2018-07-26T15:05:25.416 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 7330b3eafae6d838... #7886079 @ 2018-07-26T15:05:24.500 signed by eosswedenorg [trxs: 0, lib: 7885769, conf: 0, latency: 916 ms]
2018-07-26T15:05:25.422 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block e1658f16f0e3f94b... #7886080 @ 2018-07-26T15:05:25.000 signed by eosswedenorg [trxs: 1, lib: 7885769, conf: 0, latency: 422 ms]
2018-07-26T15:05:25.455 thread-0 producer_plugin.cpp:327 on_incoming_block ] Received block 48d19d4ae1a2d820... #7886078 @ 2018-07-26T15:05:23.500 signed by eosriobrazil [trxs: 694, lib: 7885769, conf: 0, latency: 1955 ms]
Segmentation fault (core dumped)
As of release 1.1.1, Multiple servers (consensus only, producer, and API)
have recently crashed due to this issue -- it hasn't been a one time thing,
it seems to be recurrent (has happened right after a replay).
Behaviour is strange, but consistent. It seems to occur with:
- TX Heavy Blocks
- Forking on switching producers due to latency between them (likely
due to the TX heavy blocks)
- On receipt of a block that already got microforked off due to the
swap
Something isn't being done correctly with the fork database. Theres no bad
alloc so it doesn't seem to be an issue with the reversible block db size
(it also is currently set at 4GB, far higher than the default 340MB).
Overflowed TX queuing somewhere?
Since this error has been recurrent for us, I'm happy to try adding debug
flags / gdb / etc. if there's some flags that can help you guys track this
issue.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4884>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACYR4lIu9VOHelipQMmfUMdh79x9AI_gks5uKez_gaJpZM4ViJpV>
.
|
Some Ubuntu 18.04, some 16.04, and across multiple CPU platforms (i7 7700k and Xeons on bare metal, and Xeon VMs). Other nodes have similar configs (removed action-blacklist on that node and it still crashed due to this issue.) Is there a good way to get the stack trace printed out for nodeos? |
|
Couldn't seem to get the coredumps from the previous crashes, but I'll run through gdb and see what I can do! |
Here's a |
Can you get bt from all threads. |
Here's from thread 7, the starred one. Let me know if you want a different one. |
How about #5 |
Thread 5: |
Looks like theres some stuff with bnet there. Wondering if there's some kind of race condition due to the node mixing both bnet and regular p2p connections. |
I notice you are not connecting to any bnet peers. Any idea how many are connecting to this node? |
This was from one public seed, which has least 4 nodes in our private Greymass infrastructure connected to it via bnet, (seeds, apis, and producer), plus anyone who connects to it via the publicly displayed address. |
How much memory do you have on this machine? I see a bad_alloc. |
64GB, and the program is only using about 2GB of it. |
It seems sometimes it bad_alloc's, sometimes it segfaults. |
So you don't see the memory spike before it dies? |
Hm, no, does not look like there is a memory spike. |
Multiple mainnet nodes went down in a single event it seems. This is my backtrace for this one: |
It might be just a coincidence, but Tokenika node seems to be segfaulting slower (it didn't overnight) with history plugin turned off. |
I've split up some nodes into different styles of configs, and have so far witnessed the following behaviour:
I will expand my tests to see if allowing multiple connections from the same IP affects things. |
My api node (running 1.0.7) crashed. The machine has 128G memory and ubuntu:18.04 nodeos running in docker. The state folder >64G at the time it crashed. I have attached the config.ini: https://pastebin.com/27ihP63k |
@dfguo I believe this issue is different -- your crash appears to be due to |
Closed as refers to old version of code. |
As of release 1.1.1, Multiple servers (consensus only, producer, and API) have recently crashed due to this issue -- it hasn't been a one time thing, it seems to be recurrent (has happened right after a replay).
Behaviour is strange, but consistent. It seems to occur with:
Something isn't being done correctly with the fork database. Theres no
bad alloc
so it doesn't seem to be an issue with the reversible block db size (it also is currently set at 4GB, far higher than the default 340MB). Overflowed TX queuing somewhere?Since this error has been recurrent for us, I'm happy to try adding debug flags / gdb / etc. if there's some flags that can help you guys track this issue.
The text was updated successfully, but these errors were encountered: