Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Witness node crash, block id has trailing zeros #1256

Closed
abitmore opened this issue Aug 14, 2018 · 42 comments
Closed

Witness node crash, block id has trailing zeros #1256

abitmore opened this issue Aug 14, 2018 · 42 comments
Labels
3d Bug Classification indicating the existing implementation does not match the intention of the design bug

Comments

@abitmore
Copy link
Member

Log:

1404ms th_a application.cpp:506 handle_block ] Got block: #29608853 01c3cb95bb2461ebed15ad291f5aecaed1b8eeb9 time: 2018-08-14T15:00:00 latency: 1404 ms from: gdex-witness irreversible: 29608833 (-20)
2578ms th_a application.cpp:649 get_item ] Couldn't find block 01b865c6bec836f034d20ba44000000000000000 -- corresponding ID in our chain is 01b865c6bec836f034d20ba4b3302735a2a95601
2578ms th_a application.cpp:560 handle_transaction ] Got 3 transactions from network
Segmentation fault (core dumped)

@abitmore abitmore added bug 3d Bug Classification indicating the existing implementation does not match the intention of the design labels Aug 14, 2018
@jmjatlanta
Copy link
Contributor

An address like that makes me think it was some kind of mis-sized pack or unpack. Anything mysterious on the machine? Flaky memory / drive?

@abitmore
Copy link
Member Author

Disk space should be OK. I'm not monitoring memory though, assuming it's OK. The process was not killed due to OOM.

@pmconrad
Copy link
Contributor

May be a race condition when a new block comes in while the last one is being processed.

@pmconrad
Copy link
Contributor

Another report from telegram:

1647657ms th_a       application.cpp:506           handle_block         ] Got block: #29475418 01c1c25a578d8c74191a321bf7e26e1a04c2de70 time: 2018-08-09T23:27:27 latency: 657 ms from: witness.still  irreversible: 29475397 (-21)
1647758ms th_a       database_api.cpp:282          ~database_api_impl   ] freeing database api 24392467680
1647758ms th_a       database_api.cpp:282          ~database_api_impl   ] freeing database api 25724016352
1647892ms th_a       application.cpp:649           get_item             ] Couldn't find block 01b8667644aba5092640a0dc4000000000000000 -- corresponding ID in our chain is 01b8667644aba5092640a0dc98b447659199f406

@abitmore
Copy link
Member Author

Got 2 more reports from other people today:

1002181ms th_a       application.cpp:506           handle_block         ] Got block: #29720195 01c57e83ec444abb8746675f32044976da4ec9f2 time: 2018-08-18T12:16:42 latency: 181 ms from: witness.yao  irreversible: 29720177 (-18)
1002199ms th_a       application.cpp:560           handle_transaction   ] Got 5 transactions from network
1002208ms th_a       application.cpp:649           get_item             ] Couldn't find block 01b8667644aba5092640a0dc4000000000000000 -- corresponding ID in our chain is 01b8667644aba5092640a0dc98b447659199f406
2937181ms th_a       application.cpp:506           handle_block         ] Got block: #29729186 01c5a1a21e7daa135c2acbdfcde2f6b6ea54fe12 time: 2018-08-18T19:48:57 latency: 181 ms from: witness.yao  irreversible: 29729167 (-19)
2937702ms th_a       application.cpp:560           handle_transaction   ] Got 8 transactions from network
2938001ms th_a       application.cpp:649           get_item             ] Couldn't find block 01b8674e83fe445e3da3bf854000000000000000 -- corresponding ID in our chain is 01b8674e83fe445e3da3bf85266dc16741a4625f
Segmentation fault

Looks like all these are API nodes.

@oxarbitrage
Copy link
Member

i restarted a crashed by this issue api node in gdb looking for a backtrace.

@jmjatlanta
Copy link
Contributor

If possible, would you please include environment (versions of bitshares-core, boost, openssl, compiler, active plugins, etc.).

@abitmore
Copy link
Member Author

abitmore commented Aug 21, 2018

Mine was Ubuntu 16.04 default build environment.

Version: 2.0.180612
SHA: 27d2e3e3df71d937e7ad5f6f60254de8033d72cd
Timestamp: 70 days ago
SSL: OpenSSL 1.0.2g  1 Mar 2016
Boost: 1.58
Websocket++: 0.7.0

@OpenLedgerDev
Copy link

Would you please include steps how to reproduce it or under what circumstances it has happened, host OS and compiler.

@abitmore
Copy link
Member Author

I'm not able to reproduce so far.

@oxarbitrage
Copy link
Member

got backtrace for this:

2292481ms th_a       application.cpp:506           handle_block         ] Got block: #30154763 01cc200b840a1b366dda26555052378dbdcafc8f time: 2018-09-02T15:38:12 latency: 481 ms from: xman  irreversible: 30154744 (-19)
2293018ms th_a       application.cpp:649           get_item             ] Couldn't find block 01b865addb8be51aa6735f0d4000000000000000 -- corresponding ID in our chain is 01b865addb8be51aa6735f0d00a93b2729fff5e8
2293215ms th_a       application.cpp:560           handle_transaction   ] Got 23 transactions from network

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2b94700 (LWP 10592)]
0x00000000014c1304 in std::_Hashtable<std::shared_ptr<graphene::net::peer_connection>, std::shared_ptr<graphene::net::peer_connection>, std::allocator<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Identity, std::equal_to<std::shared_ptr<graphene::net::peer_connection> >, std::hash<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::find (this=this@entry=0xa31ce0a0, __k=<error reading variable: Cannot access memory at address 0x574e5366d4d87>)
    at /usr/include/c++/4.9/bits/hashtable.h:1298
1298        _Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal,
(gdb) bt
#0  0x00000000014c1304 in std::_Hashtable<std::shared_ptr<graphene::net::peer_connection>, std::shared_ptr<graphene::net::peer_connection>, std::allocator<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Identity, std::equal_to<std::shared_ptr<graphene::net::peer_connection> >, std::hash<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::find (this=this@entry=0xa31ce0a0, __k=<error reading variable: Cannot access memory at address 0x574e5366d4d87>)
    at /usr/include/c++/4.9/bits/hashtable.h:1298
#1  0x0000000001498e37 in find (__x=<error reading variable: Cannot access memory at address 0x574e5366d4d87>, this=0xa31ce0a0)
    at /usr/include/c++/4.9/bits/unordered_set.h:547
#2  graphene::net::detail::node_impl::terminate_inactive_connections_loop (this=0xa31cdd70) at /root/repo/pull1103/bitshares-core/libraries/net/node.cpp:985
#3  0x000000000149a91c in operator() (__closure=<optimized out>) at /root/repo/pull1103/bitshares-core/libraries/net/node.cpp:999
#4  fc::detail::void_functor_run<graphene::net::detail::node_impl::terminate_inactive_connections_loop()::<lambda()> >::run(void *, void *) (functor=<optimized out>, 
    prom=0x7fff4a27cc90) at /root/repo/pull1103/bitshares-core/libraries/fc/include/fc/thread/task.hpp:83
#5  0x000000000134f923 in fc::task_base::run_impl (this=0x7fff4a27cca0) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/task.cpp:43
#6  0x000000000134db7f in run_next_task (this=<optimized out>) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/thread_d.hpp:513
#7  fc::thread_d::process_tasks (this=0x7fffec0008c0) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/thread_d.hpp:562
#8  0x000000000134dde1 in fc::thread_d::start_process_tasks (my=140737152813248) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/thread_d.hpp:493
#9  0x000000000159e9b1 in make_fcontext ()
#10 0x0000000000000000 in ?? ()
(gdb) 

@abitmore
Copy link
Member Author

abitmore commented Sep 3, 2018

So it points to terminate_inactive_connections_loop() again, same as #1287.

@jmjatlanta
Copy link
Contributor

It seems to me that we are not doing all operations on the _active_connections collection in the same thread. I believe it has been determined that VERIFY_CORRECT_THREAD is a no-op when compiling without the DEBUG flag. (also, I have not verified that all methods call VERIFY_CORRECT_THREAD even with the DEBUG flag). So either the find() or the end() is returning an object that at the same time has been deleted by another thread.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Sep 3, 2018

I have verified that deleting an item from the collection can cause SIGSEGV when doing a find() ( see https://gist.github.com/jmjatlanta/63cf7a13b9c88d47c9eb63deee5e5118 ).

Now the question is: What do we do about it? Do we make sure everyone is on the same thread? Change to a thread-safe list?

@abitmore
Copy link
Member Author

abitmore commented Sep 4, 2018

Do we make sure everyone is on the same thread? Change to a thread-safe list?

IMHO both are doable, depends on which is easier.

@abitmore
Copy link
Member Author

abitmore commented Sep 5, 2018

Probably related: #402, #885. With debug build, there will be assertion failures; but all assertions are dropped in release build.

@OpenLedgerApp
Copy link
Contributor

Thank you for your feedback guys. We will try to reproduce it and investigate further.

@HarukaMa
Copy link
Contributor

HarukaMa commented Sep 6, 2018

I was experiencing the same issue, here is the log in case it's useful:

296200ms th_a       database_api.cpp:292          database_api_impl    ] creating database api 11907916016
297063ms th_a       application.cpp:565           handle_transaction   ] Got 9 transactions from network
297349ms th_a       application.cpp:511           handle_block         ] Got block: #30261662 01cdc19e5909c3dca002d3804f19920bc7c67af2 time: 2018-09-06T09:04:57 latency: 349 ms from: in.abit  irreversible: 30261635 (-27)
297664ms th_a       application.cpp:511           handle_block         ] Got block: #30261662 01cdc19e5909c3dca002d3804f19920bc7c67af2 time: 2018-09-06T09:04:57 latency: 664 ms from: in.abit  irreversible: 30261636 (-26)
297778ms th_a       application.cpp:654           get_item             ] Couldn't find block 01b865addb8be51aa6735f0d4000000000000000 -- corresponding ID in our chain is 01b865addb8be51aa6735f0d00a93b2729fff5e8
[1]    11869 bus error (core dumped)  programs/witness_node/witness_node

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Sep 6, 2018

All research is pointing to

std::unordered_set<graphene::net::peer_connection_ptr> _active_connections;
being accessed by more than one thread.

This collection is often iterated over, and is added to / deleted from in a number of places. The single-threaded nature of std::unordered_set makes this collection inappropriate for our implementation.

To correct this problem, I believe we have three options. All options have side-effects:

  • Assure that all calls to functions that read from and write to this collection are on one thread.
    • Assuring that all calls happen on the same thread was evidently how it was supposed to work at the start. Finding what changed and preventing another thread from modifying the collection would fix this problem, but does not future-proof the code from another modification in the future ( another change, and we'll be back in the same boat ).
  • Use a collection that is thread-safe
    • At first glance, this is an easy fix. Change from std::unordered_set to a thread-safe implementation and we are done. But the many instances of iteration in our own code would probably put this collection in lock contention quite often.
  • Implement locking whenever the collection is read from / written to.
    • This solution is cumbersome. A quick search shows 27 places where this collection is being iterated over. Add that to the list of places where the collection is searched, added to, or deleted from and there will be many places to be evaluated/edited/refactored. Each location will need to be evaluated to determine the best way to include locking. Of course, some of these places may be refactored to a few "standard" locking scenarios, so such refactoring may provide some maintenance benefits. But the task is not a small one.

Each time I think that one of these three options is the best option, I find code that makes me scratch my head and say "no, there must be a better way".

I do think that a testing framework is good to have, but this comes at a cost, and of course lab conditions are not real-world conditions. The test application I have built so far puts a large load on this collection, and it does segfault, but so far has not pointed me to a red herring that is causing the original issue. I believe this (the lack of a red-herring) is due to the test exercising the node "under the covers" instead of being in a real-world TESTNET-style environment.

I am posting this here to give you a "progress report" as well as to ask for your thoughts on the options above, or perhaps to hear of an option I have not thought of.

Side note (mainly to remind me): std::unordered_set objects should not be modified, as that could change the hash. But the standard library has a specialization of std::shared_ptr that makes our implementation not susceptible to this issue.

@abitmore
Copy link
Member Author

abitmore commented Sep 6, 2018

Good job @jmjatlanta. I feel that the second option is the easiest for fixing this issue.

@abitmore
Copy link
Member Author

abitmore commented Sep 10, 2018

It seems not easy to find a perfect unordered set library, so comes another question: do we have to use an unordered set? I assume that usually there are less than 100 connections to maintain, perhaps an ordered set will work just fine?

Update: just realized it's probably harder to find an ordered set library. If it's the case, please ignore my comments above..

@oxarbitrage
Copy link
Member

Had another one today, i know is the same but posting so we know how often is happening to me.

I tend to think that the most traffic the api node haves the more likely to happen but not totally sure. This is an api node used by open-explorer.io that kind of always crash when cryptofresh is down and the traffic moves to this explorer.

1871453ms th_a       application.cpp:560           handle_transaction   ] Got 28 transactions from network
1872384ms th_a       application.cpp:506           handle_block         ] Got block: #30427109 01d047e50335ca1c685d5a696efa91f0aa0e6be0 time: 2018-09-12T03:31:12 latency: 384 ms from: sahkan-bitshares  irreversible: 30427081 (-28)
1873039ms th_a       application.cpp:649           get_item             ] Couldn't find block 01b865ac0903a3b6f938ddd84000000000000000 -- corresponding ID in our chain is 01b865ac0903a3b6f938ddd80511686dd7eb0b62
1873039ms th_a       application.cpp:506           handle_block         ] Got block: #30427109 01d047e50335ca1c685d5a696efa91f0aa0e6be0 time: 2018-09-12T03:31:12 latency: 1039 ms from: sahkan-bitshares  irreversible: 30427085 (-24)
1873113ms th_a       application.cpp:560           handle_transaction   ] Got 18 transactions from network

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2b94700 (LWP 26570)]
0x00000000014c1304 in std::_Hashtable<std::shared_ptr<graphene::net::peer_connection>, std::shared_ptr<graphene::net::peer_connection>, std::allocator<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Identity, std::equal_to<std::shared_ptr<graphene::net::peer_connection> >, std::hash<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::find (this=this@entry=0xa31d79d0, __k=<error reading variable: Cannot access memory at address 0x575a438e2fd54>)
    at /usr/include/c++/4.9/bits/hashtable.h:1298
1298        _Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal,
(gdb) bt
#0  0x00000000014c1304 in std::_Hashtable<std::shared_ptr<graphene::net::peer_connection>, std::shared_ptr<graphene::net::peer_connection>, std::allocator<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Identity, std::equal_to<std::shared_ptr<graphene::net::peer_connection> >, std::hash<std::shared_ptr<graphene::net::peer_connection> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::find (this=this@entry=0xa31d79d0, __k=<error reading variable: Cannot access memory at address 0x575a438e2fd54>)
    at /usr/include/c++/4.9/bits/hashtable.h:1298
#1  0x0000000001498e37 in find (__x=<error reading variable: Cannot access memory at address 0x575a438e2fd54>, this=0xa31d79d0)
    at /usr/include/c++/4.9/bits/unordered_set.h:547
#2  graphene::net::detail::node_impl::terminate_inactive_connections_loop (this=0xa31d76a0) at /root/repo/pull1103/bitshares-core/libraries/net/node.cpp:985
#3  0x000000000149a91c in operator() (__closure=<optimized out>) at /root/repo/pull1103/bitshares-core/libraries/net/node.cpp:999
#4  fc::detail::void_functor_run<graphene::net::detail::node_impl::terminate_inactive_connections_loop()::<lambda()> >::run(void *, void *) (functor=<optimized out>, 
    prom=0x7fff47d16200) at /root/repo/pull1103/bitshares-core/libraries/fc/include/fc/thread/task.hpp:83
#5  0x000000000134f923 in fc::task_base::run_impl (this=0x7fff47d16210) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/task.cpp:43
#6  0x000000000134db7f in run_next_task (this=<optimized out>) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/thread_d.hpp:513
#7  fc::thread_d::process_tasks (this=0x7fffec0008c0) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/thread_d.hpp:562
#8  0x000000000134dde1 in fc::thread_d::start_process_tasks (my=140737152813248) at /root/repo/pull1103/bitshares-core/libraries/fc/src/thread/thread_d.hpp:493
#9  0x000000000159e9b1 in make_fcontext ()
#10 0x0000000000000000 in ?? ()
(gdb) 

@abitmore
Copy link
Member Author

abitmore commented Apr 6, 2019

clear_expired_transactions is a bit unusual in that it operates directly on a mutable index instead of using the normal database::remove call. Any idea why?

I think it's old code, it's there before database::remove() etc were in place. @nathanhourt do you know?

@nathanielhourt
Copy link
Contributor

Oh wow, that is old code... Yeah, it may very well predate database::remove() -- the only reason I can think of to bypass it otherwise is to avoid creating undo data. Logically it would predate it, since transaction handling has to get written before operations, evaluators, and objects.

Does thread #3 actually throw? It doesn't look like it, but that's the only thread doing anything when the core dumped...

@abitmore
Copy link
Member Author

abitmore commented Apr 7, 2019

The core dump says:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000560e0537ec9f in graphene::net::detail::message_oriented_connection_impl::read_loop() ()
[Current thread is 1 (Thread 0x7f08ccff9700 (LWP 10144))]

@nathanielhourt
Copy link
Contributor

nathanielhourt commented Apr 9, 2019

Ahhh, OK, my mistake. So that means, according to the debug build, the crash is here? Can we confirm that the target has the same line 160? If the crash is really on that line, it likely means that this has been deleted, or the stack has been munged.

@abitmore
Copy link
Member Author

Ahhh, OK, my mistake. So that means, according to the debug build, the crash is here? Can we confirm that the target has the same line 160? If the crash is really on that line, it likely means that this has been deleted, or the stack has been munged.

Yes I think so.

@nathanielhourt
Copy link
Contributor

Have we got any way of reproducing this crash?

@jmjatlanta jmjatlanta removed their assignment Apr 11, 2019
@jmjatlanta
Copy link
Contributor

Unassigning myself, as this will require more brainpower than I have. This will probably require (as it has to this point) a team effort. I hope for some big clues that can point us to a true cause.

@abitmore
Copy link
Member Author

@nathanhourt for the issue Gandalf reported above, in order to reproduce it, he tried several times to resync, sometimes the node crashed.

For the issue described in OP, I don't know a way to reproduce.

@abitmore
Copy link
Member Author

abitmore commented Jul 5, 2019

VERIFY_CORRECT_THREAD() is defined in several files, but the one defined in node.cpp is different, IMHO it's a bug.

#ifdef P2P_IN_DEDICATED_THREAD
# define VERIFY_CORRECT_THREAD() assert(_thread->is_current())
#else
# define VERIFY_CORRECT_THREAD() do {} while (0)
#endif

At least, we need to check #ifndef NDEBUG like elsewhere.

@abitmore
Copy link
Member Author

abitmore commented Sep 16, 2019

Reported by @syalon :

429917ms th_a application.cpp:631 handle_transaction ] Got 8 transactions from network
witness_node: /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread_d.hpp:393: bool fc::thread_d::start_next_fiber(bool): Assertion `std::current_exception() == std::exception_ptr()' failed.

(gdb) bt
#0  0x00007f21263be428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f21263c002a in __GI_abort () at abort.c:89
#2  0x00007f21263b6bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x66514f8 "std::current_exception() == std::exception_ptr()",
    file=file@entry=0x66513d0 "/usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread_d.hpp", line=line@entry=393,
    function=function@entry=0x6652200 <fc::thread_d::start_next_fiber(bool)::__PRETTY_FUNCTION__> "bool fc::thread_d::start_next_fiber(bool)") at assert.c:92
#3  0x00007f21263b6c82 in __GI___assert_fail (assertion=0x66514f8 "std::current_exception() == std::exception_ptr()",
    file=0x66513d0 "/usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread_d.hpp", line=393,
    function=0x6652200 <fc::thread_d::start_next_fiber(bool)::__PRETTY_FUNCTION__> "bool fc::thread_d::start_next_fiber(bool)") at assert.c:101
#4  0x0000000003681ba0 in fc::thread_d::start_next_fiber (this=0x8c9c960, reschedule=true) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread_d.hpp:393
#5  0x000000000367a8ef in fc::thread::yield (this=0x8bf6cf0, reschedule=true) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread.cpp:280
#6  0x000000000367b835 in fc::yield () at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread.cpp:366
#7  0x00000000036919db in fc::spin_yield_lock::lock (this=0x7f20a43bbe3c) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/spin_yield_lock.cpp:41
#8  0x000000000368fc32 in fc::unique_lock<fc::spin_yield_lock&>::lock (this=0x7f20925ffb20) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/include/fc/thread/unique_lock.hpp:21
#9  0x000000000368fb43 in fc::unique_lock<fc::spin_yield_lock&>::unique_lock (this=0x7f20925ffb20, l=...) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/include/fc/thread/unique_lock.hpp:17
#10 0x000000000368f805 in fc::promise_base::_set_value (this=0x7f20a43bbe20, s=0x0) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/future.cpp:137
#11 0x000000000368ef9a in fc::promise_base::set_exception (this=0x7f20a43bbe20, e=std::shared_ptr (count 2, weak 0) 0x9308fd0) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/future.cpp:47
#12 0x000000000369053a in fc::task_base::run_impl (this=0x7f20a43bbd80) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/task.cpp:56
#13 0x00000000036900b0 in fc::task_base::run (this=0x7f20a43bbd80) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/task.cpp:33
#14 0x0000000003682856 in fc::thread_d::run_next_task (this=0x8c9c960) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread_d.hpp:534
#15 0x0000000003682dd9 in fc::thread_d::process_tasks (this=0x8c9c960) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread_d.hpp:583
#16 0x00000000036822f9 in fc::thread_d::start_process_tasks (my=147442016) at /usr/btspp/dbg331b/bitshares-core/libraries/fc/src/thread/thread_d.hpp:514
#17 0x0000000003fbbf91 in make_fcontext ()
#18 0x0000000000000000 in ?? ()

@abitmore
Copy link
Member Author

New report from @syalon (debug build, on dbg331b branch (075cd44) with the bitshares/bitshares-fc#165 patch (bitshares/bitshares-fc@f4179b4) applied ):

witness_node: /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread_d.hpp:386: bool fc::thread_d::start_next_fiber(bool): Assertion `non_preemptable_scope_count == 0' failed.
Aborted


Thread 18 "p2p" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fa1a8be2700 (LWP 2919)]
0x00007fa1c7af8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007fa1c7af8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007fa1c7afa02a in __GI_abort () at abort.c:89
#2  0x00007fa1c7af0bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x6653d80 "non_preemptable_scope_count == 0",
    file=file@entry=0x6653c70 "/usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread_d.hpp", line=line@entry=386,
    function=function@entry=0x6654b20 <fc::thread_d::start_next_fiber(bool)::__PRETTY_FUNCTION__> "bool fc::thread_d::start_next_fiber(bool)") at assert.c:92
#3  0x00007fa1c7af0c82 in __GI___assert_fail (assertion=0x6653d80 "non_preemptable_scope_count == 0",
    file=0x6653c70 "/usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread_d.hpp", line=386,
    function=0x6654b20 <fc::thread_d::start_next_fiber(bool)::__PRETTY_FUNCTION__> "bool fc::thread_d::start_next_fiber(bool)") at assert.c:101
#4  0x00000000036825d2 in fc::thread_d::start_next_fiber (this=0x7fa0354aee80, reschedule=false) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread_d.hpp:386
#5  0x000000000367b36f in fc::thread::yield (this=0x2ba284b0, reschedule=false) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread.cpp:280
#6  0x0000000003f52952 in fc::mutex::lock (this=0x2baa8f98) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/mutex.cpp:177
#7  0x00000000038fa1a8 in fc::scoped_lock<fc::mutex>::scoped_lock (this=0x7fa1a83e1a30, l=...) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/include/fc/thread/scoped_lock.hpp:7
#8  0x00000000038769e5 in graphene::net::detail::node_impl::fetch_sync_items_loop (this=0x2baa8b90) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/net/node.cpp:539
#9  0x00000000038c4647 in graphene::net::detail::node_impl::<lambda()>::operator()(void) const (__closure=0x7fa03550ff20) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/net/node.cpp:4276
#10 0x00000000038eda96 in fc::detail::void_functor_run<graphene::net::detail::node_impl::connect_to_p2p_network()::<lambda()> >::run(void *, void *) (functor=0x7fa03550ff20, prom=0x7fa03550ff18)
    at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/include/fc/thread/task.hpp:109
#11 0x00000000036917ba in fc::task_base::run_impl (this=0x7fa03550fea0) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/task.cpp:44
#12 0x000000000369173a in fc::task_base::run (this=0x7fa03550fea0) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/task.cpp:33
#13 0x0000000003683876 in fc::thread_d::run_next_task (this=0x7fa0354aee80) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread_d.hpp:543
#14 0x0000000003683df9 in fc::thread_d::process_tasks (this=0x7fa0354aee80) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread_d.hpp:592
#15 0x0000000003683319 in fc::thread_d::start_process_tasks (my=140326065598080) at /usr/btspp/dbg331bwith_fcpr_debug/bitshares-core/libraries/fc/src/thread/thread_d.hpp:523
#16 0x0000000003fbd5b1 in make_fcontext ()
#17 0x0000000000000000 in ?? ()

@pmconrad
Copy link
Contributor

I think that's an artifact of the dbg331b branch. It added mutexes around _active_connections loops, and one such loop was inside an ASSERT_NOT_PREEMPTED block. So when the task tried to lock the mutex it yielded and that fired the assert.

@abitmore
Copy link
Member Author

abitmore commented Sep 19, 2019

Yes, the mutex caused the yielding, so triggered the assertion due to the ASSERT_NOT_PREEMPTED. However, without the mutex, there might be 2 tasks accessing _active_connection at the same time which may lead to other issues. IMO we should accept yielding and try to keep the program work correctly when one task yields. I removed ASSERT_NOT_PREEMPTED from dbg331b branch in a later commit.

@abitmore
Copy link
Member Author

abitmore commented Feb 2, 2020

Assuming that this issue is fixed by #1308. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3d Bug Classification indicating the existing implementation does not match the intention of the design bug
Projects
None yet
Development

No branches or pull requests

8 participants