Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Micro-forks are causing Segmentation faults #4884

Closed
ScottSallinen opened this issue Jul 26, 2018 · 24 comments
Closed

Micro-forks are causing Segmentation faults #4884

ScottSallinen opened this issue Jul 26, 2018 · 24 comments
Assignees
Labels

Comments

@ScottSallinen
Copy link
Contributor

2018-07-26T15:05:23.312 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block 4d397be19fc69563... #7886077 @ 2018-07-26T15:05:20.000 signed by eosriobrazil [trxs: 1, lib: 7885757, conf: 0, latency: 3312 ms]
2018-07-26T15:05:25.372 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block d85bb8110e9f9d6f... #7886078 @ 2018-07-26T15:05:24.000 signed by eosswedenorg [trxs: 1005, lib: 7885769, conf: 228, latency: 1372 ms]
2018-07-26T15:05:25.416 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block 7330b3eafae6d838... #7886079 @ 2018-07-26T15:05:24.500 signed by eosswedenorg [trxs: 0, lib: 7885769, conf: 0, latency: 916 ms]
2018-07-26T15:05:25.422 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block e1658f16f0e3f94b... #7886080 @ 2018-07-26T15:05:25.000 signed by eosswedenorg [trxs: 1, lib: 7885769, conf: 0, latency: 422 ms]
2018-07-26T15:05:25.455 thread-0   producer_plugin.cpp:327       on_incoming_block    ] Received block 48d19d4ae1a2d820... #7886078 @ 2018-07-26T15:05:23.500 signed by eosriobrazil [trxs: 694, lib: 7885769, conf: 0, latency: 1955 ms]
Segmentation fault (core dumped)

As of release 1.1.1, Multiple servers (consensus only, producer, and API) have recently crashed due to this issue -- it hasn't been a one time thing, it seems to be recurrent (has happened right after a replay).
Behaviour is strange, but consistent. It seems to occur with:

  • TX Heavy Blocks
  • Forking on switching producers due to latency between them (likely due to the TX heavy blocks)
  • On receipt of a block that already got microforked off due to the swap

Something isn't being done correctly with the fork database. Theres no bad alloc so it doesn't seem to be an issue with the reversible block db size (it also is currently set at 4GB, far higher than the default 340MB). Overflowed TX queuing somewhere?

Since this error has been recurrent for us, I'm happy to try adding debug flags / gdb / etc. if there's some flags that can help you guys track this issue.

@wanderingbort
Copy link
Contributor

wanderingbort commented Jul 26, 2018 via email

@wanderingbort
Copy link
Contributor

wanderingbort commented Jul 26, 2018 via email

@ScottSallinen
Copy link
Contributor Author

ScottSallinen commented Jul 26, 2018

Some Ubuntu 18.04, some 16.04, and across multiple CPU platforms (i7 7700k and Xeons on bare metal, and Xeon VMs).
Config pastebin for one of the nodes: https://pastebin.com/r8hTib1K

Other nodes have similar configs (removed action-blacklist on that node and it still crashed due to this issue.)

Is there a good way to get the stack trace printed out for nodeos?

@heifner
Copy link
Contributor

heifner commented Jul 26, 2018

coredumpctrl dump nodeos -o core
gdb path_to_eosio/programs/nodeos/nodeos core
bt

@ScottSallinen
Copy link
Contributor Author

Couldn't seem to get the coredumps from the previous crashes, but I'll run through gdb and see what I can do!

@ScottSallinen
Copy link
Contributor Author

Here's a bt full from my newest crash:
https://pastebin.com/11b6iEFb

@heifner
Copy link
Contributor

heifner commented Jul 26, 2018

Can you get bt from all threads.
info threads
thread #
bt

@ScottSallinen
Copy link
Contributor Author

Here's from thread 7, the starred one. Let me know if you want a different one.
https://pastebin.com/WfXveDU9

@heifner
Copy link
Contributor

heifner commented Jul 26, 2018

How about #5

@ScottSallinen
Copy link
Contributor Author

Thread 5:
https://pastebin.com/TY1YNeBD

@ScottSallinen
Copy link
Contributor Author

Looks like theres some stuff with bnet there. Wondering if there's some kind of race condition due to the node mixing both bnet and regular p2p connections.

@heifner
Copy link
Contributor

heifner commented Jul 26, 2018

I notice you are not connecting to any bnet peers. Any idea how many are connecting to this node?

@ScottSallinen
Copy link
Contributor Author

This was from one public seed, which has least 4 nodes in our private Greymass infrastructure connected to it via bnet, (seeds, apis, and producer), plus anyone who connects to it via the publicly displayed address.

@heifner
Copy link
Contributor

heifner commented Jul 26, 2018

How much memory do you have on this machine? I see a bad_alloc.

@ScottSallinen
Copy link
Contributor Author

64GB, and the program is only using about 2GB of it.

@ScottSallinen
Copy link
Contributor Author

It seems sometimes it bad_alloc's, sometimes it segfaults.

@heifner
Copy link
Contributor

heifner commented Jul 26, 2018

So you don't see the memory spike before it dies?

@ScottSallinen
Copy link
Contributor Author

Hm, no, does not look like there is a memory spike.

@ScottSallinen
Copy link
Contributor Author

Multiple mainnet nodes went down in a single event it seems. This is my backtrace for this one:
https://pastebin.com/9gK5sAVv

@perduta
Copy link

perduta commented Jul 27, 2018

It might be just a coincidence, but Tokenika node seems to be segfaulting slower (it didn't overnight) with history plugin turned off.
EDIT: I'll turn systemd-coredump and let you know if we get something.

@ScottSallinen
Copy link
Contributor Author

I've split up some nodes into different styles of configs, and have so far witnessed the following behaviour:

  • bnet only: does not seem to have crashed yet.
  • p2p only: does not seem to have crashed yet.
  • Mixed bnet and p2p: some nodes have crashed.

I will expand my tests to see if allowing multiple connections from the same IP affects things.

@dfguo
Copy link

dfguo commented Jul 27, 2018

My api node (running 1.0.7) crashed. The machine has 128G memory and ubuntu:18.04 nodeos running in docker. The state folder >64G at the time it crashed.

I have attached the config.ini: https://pastebin.com/27ihP63k

@ScottSallinen
Copy link
Contributor Author

@dfguo I believe this issue is different -- your crash appears to be due to filter-on=* exceeding 65536 mb on the mainnet.

@halsaphi
Copy link
Contributor

halsaphi commented Nov 1, 2018

Closed as refers to old version of code.
Please refer to latest code and documentation:
https://github.com/EOSIO
https://developers.eos.io/
If problem persists with latest version please raise new issue.

@halsaphi halsaphi closed this as completed Nov 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants