-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Witness node crash, block id has trailing zeros #1256
Comments
An address like that makes me think it was some kind of mis-sized pack or unpack. Anything mysterious on the machine? Flaky memory / drive? |
Disk space should be OK. I'm not monitoring memory though, assuming it's OK. The process was not killed due to OOM. |
May be a race condition when a new block comes in while the last one is being processed. |
Another report from telegram:
|
Got 2 more reports from other people today:
Looks like all these are API nodes. |
i restarted a crashed by this issue api node in gdb looking for a backtrace. |
If possible, would you please include environment (versions of bitshares-core, boost, openssl, compiler, active plugins, etc.). |
Mine was Ubuntu 16.04 default build environment.
|
Would you please include steps how to reproduce it or under what circumstances it has happened, host OS and compiler. |
I'm not able to reproduce so far. |
got backtrace for this:
|
So it points to |
It seems to me that we are not doing all operations on the _active_connections collection in the same thread. I believe it has been determined that VERIFY_CORRECT_THREAD is a no-op when compiling without the DEBUG flag. (also, I have not verified that all methods call VERIFY_CORRECT_THREAD even with the DEBUG flag). So either the find() or the end() is returning an object that at the same time has been deleted by another thread. |
I have verified that deleting an item from the collection can cause SIGSEGV when doing a find() ( see https://gist.github.com/jmjatlanta/63cf7a13b9c88d47c9eb63deee5e5118 ). Now the question is: What do we do about it? Do we make sure everyone is on the same thread? Change to a thread-safe list? |
IMHO both are doable, depends on which is easier. |
Thank you for your feedback guys. We will try to reproduce it and investigate further. |
I was experiencing the same issue, here is the log in case it's useful:
|
All research is pointing to bitshares-core/libraries/net/node_impl.hxx Line 277 in 2a175e6
This collection is often iterated over, and is added to / deleted from in a number of places. The single-threaded nature of std::unordered_set makes this collection inappropriate for our implementation. To correct this problem, I believe we have three options. All options have side-effects:
Each time I think that one of these three options is the best option, I find code that makes me scratch my head and say "no, there must be a better way". I do think that a testing framework is good to have, but this comes at a cost, and of course lab conditions are not real-world conditions. The test application I have built so far puts a large load on this collection, and it does segfault, but so far has not pointed me to a red herring that is causing the original issue. I believe this (the lack of a red-herring) is due to the test exercising the node "under the covers" instead of being in a real-world TESTNET-style environment. I am posting this here to give you a "progress report" as well as to ask for your thoughts on the options above, or perhaps to hear of an option I have not thought of. Side note (mainly to remind me): std::unordered_set objects should not be modified, as that could change the hash. But the standard library has a specialization of std::shared_ptr that makes our implementation not susceptible to this issue. |
Good job @jmjatlanta. I feel that the second option is the easiest for fixing this issue. |
It seems not easy to find a perfect unordered set library, so comes another question: do we have to use an unordered set? I assume that usually there are less than 100 connections to maintain, perhaps an ordered set will work just fine? Update: just realized it's probably harder to find an ordered set library. If it's the case, please ignore my comments above.. |
Had another one today, i know is the same but posting so we know how often is happening to me. I tend to think that the most traffic the api node haves the more likely to happen but not totally sure. This is an api node used by open-explorer.io that kind of always crash when cryptofresh is down and the traffic moves to this explorer.
|
I think it's old code, it's there before |
Oh wow, that is old code... Yeah, it may very well predate Does thread #3 actually throw? It doesn't look like it, but that's the only thread doing anything when the core dumped... |
The core dump says:
|
Ahhh, OK, my mistake. So that means, according to the debug build, the crash is here? Can we confirm that the target has the same line 160? If the crash is really on that line, it likely means that |
Yes I think so. |
Have we got any way of reproducing this crash? |
Unassigning myself, as this will require more brainpower than I have. This will probably require (as it has to this point) a team effort. I hope for some big clues that can point us to a true cause. |
@nathanhourt for the issue Gandalf reported above, in order to reproduce it, he tried several times to resync, sometimes the node crashed. For the issue described in OP, I don't know a way to reproduce. |
At least, we need to check |
Reported by @syalon :
|
New report from @syalon (debug build, on
|
I think that's an artifact of the dbg331b branch. It added mutexes around |
Yes, the mutex caused the yielding, so triggered the assertion due to the |
Assuming that this issue is fixed by #1308. Closing. |
Log:
The text was updated successfully, but these errors were encountered: