Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-4.0.3 witness_node CPU 100% #2078

Closed
1 of 17 tasks
abitmore opened this issue Dec 9, 2019 · 8 comments
Closed
1 of 17 tasks

test-4.0.3 witness_node CPU 100% #2078

abitmore opened this issue Dec 9, 2019 · 8 comments
Assignees
Labels
2f Testing Status indicating the solution is being evaluated against the test case(s) 3d Bug Classification indicating the existing implementation does not match the intention of the design 6 Performance Impacts flag identifying system/user efficiency, performance, etc. 9a Tiny Effort estimation indicating TBD

Comments

@abitmore
Copy link
Member

abitmore commented Dec 9, 2019

Bug Description
After started witness_node for a while, its CPU went to 100% .

From my stack backtrace, it seems it's either related to log rotation (thread 1) or task switching (thread 18, seems potentially a stack overflow), or both. Note: after interrupted the process in gdb then continued, thread 18 became normal.

Thread 19 (Thread 0x7fffa7ffc700 (LWP 12844)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2cdc2 in boost::asio::detail::task_io_service::run(boost::system::error_code&) ()
#2  0x0000000004d2fb4a in boost_asio_detail_posix_thread_function ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7fffa7ffc700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 18 (Thread 0x7fffc0ff9700 (LWP 12843)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000000004d201ed in fc::thread_d::process_tasks() ()
#2  0x0000000004d2045c in fc::thread_d::start_process_tasks(long) ()
#3  0x0000000005aacc21 in make_fcontext ()
#4  0x0000000000006400 in ?? ()
#5  0x0000000000200015 in ?? ()
#6  0x00007fffb36d6f50 in ?? ()
#7  0x00007fffa3838670 in ?? ()
#8  0x00007fffb36d6f50 in ?? ()
#9  0x00007fffa3838670 in ?? ()
#10 0x489acb9eeb42a011 in ?? ()
#11 0x2f200003dc99ce59 in ?? ()
#12 0xa017001909efd43e in ?? ()
#13 0x0f812d604c0da830 in ?? ()
#14 0x0f15fc8dcb66aaed in ?? ()
#15 0x43717695f48c18dd in ?? ()
#16 0xe5535668bc91a2c9 in ?? ()
#17 0xe65f93f762448a2b in ?? ()
#18 0xa307cc747cdc4eb6 in ?? ()
#19 0x8bc59ec84c885349 in ?? ()
#20 0x51b031e091f88d1c in ?? ()
#21 0x000064000158c95c in ?? ()
#22 0x079f000000000000 in ?? ()
#23 0x00000000000107a0 in ?? ()
#24 0x1f01000000000000 in ?? ()
#25 0x7fa6a19d82ec4318 in ?? ()
#26 0xff7bdcf6ab44ebd7 in ?? ()
#27 0x77d210ea2b516dbd in ?? ()
#28 0xe9e18507df42c828 in ?? ()
#29 0x4adcf2a112113e1e in ?? ()
#30 0x881338f4715a5c6e in ?? ()
#31 0x15c52145ac7b7c56 in ?? ()
#32 0xcadba807fb4a618c in ?? ()
#33 0xb031e091f88d0001 in ?? ()
#34 0x0064000158c95c53 in ?? ()
#35 0x9f00000000000000 in ?? ()
#36 0x000000000107a007 in ?? ()
#37 0x0100000000000000 in ?? ()
#38 0x21a784a8222f031f in ?? ()
#39 0xcb37035fdbf3fb00 in ?? ()
#40 0x3cf5c8e8193b331b in ?? ()
#41 0x4f33038550ad8ebb in ?? ()
#42 0x1f12f053c78d2b69 in ?? ()
#43 0xf0ead82ec71efcac in ?? ()
#44 0xa88742e34233c204 in ?? ()
#45 0xbeed3177ce2d4da3 in ?? ()
#46 0x31e091f88d00010d in ?? ()
#47 0x64000158c95c52b0 in ?? ()
#48 0x0000000000000000 in ?? ()

Thread 17 (Thread 0x7fffc17fa700 (LWP 12842)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffc17fa700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 16 (Thread 0x7fffc1ffb700 (LWP 12841)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffc1ffb700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 15 (Thread 0x7fffc27fc700 (LWP 12840)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffc27fc700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 14 (Thread 0x7fffc2ffd700 (LWP 12839)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffc2ffd700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 13 (Thread 0x7fffc37fe700 (LWP 12838)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffc37fe700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 12 (Thread 0x7fffc3fff700 (LWP 12837)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffc3fff700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 11 (Thread 0x7fffe0ff9700 (LWP 12836)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffe0ff9700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 10 (Thread 0x7fffe17fa700 (LWP 12835)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d2028d in fc::thread_d::process_tasks() ()
#2  0x0000000004d175c0 in fc::thread::exec() ()
#3  0x0000000004d1955f in fc::thread::thread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, fc::thread_idle_notifier*)::{lambda()#1}::operator()() const [clone .constprop.370] ()
#4  0x0000000005a5cfe5 in thread_proxy ()
#5  0x00007ffff79a76ba in start_thread (arg=0x7fffe17fa700) at pthread_create.c:333
#6  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 9 (Thread 0x7fffe1ffb700 (LWP 12834)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d27a95 in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7fffe1ffb700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 8 (Thread 0x7fffe27fc700 (LWP 12833)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d27a95 in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7fffe27fc700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 7 (Thread 0x7fffe2ffd700 (LWP 12832)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d27a95 in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7fffe2ffd700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 6 (Thread 0x7fffe37fe700 (LWP 12831)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d27a95 in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7fffe37fe700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 5 (Thread 0x7fffe3fff700 (LWP 12830)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d27a95 in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7fffe3fff700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 4 (Thread 0x7ffff0a47700 (LWP 12829)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d27a95 in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7ffff0a47700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 3 (Thread 0x7ffff1248700 (LWP 12828)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000004d27a95 in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7ffff1248700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 2 (Thread 0x7ffff1a49700 (LWP 12827)):
#0  0x00007ffff67c1a13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000004d276dc in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const [clone .isra.290] [clone .constprop.426] ()
#2  0x0000000005a5cfe5 in thread_proxy ()
#3  0x00007ffff79a76ba in start_thread (arg=0x7ffff1a49700) at pthread_create.c:333
#4  0x00007ffff67c141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 1 (Thread 0x7ffff7fd8780 (LWP 12826)):
#0  0x00007ffff67828eb in __getdents (fd=23, buf=buf@entry=0x838de20 "?\200\t", nbytes=32768) at ../sysdeps/unix/sysv/linux/getdents.c:111
#1  0x00007ffff67825f4 in __readdir_r (dirp=0x838ddf0, entry=0x8395e30, result=0x6a58870) at ../sysdeps/posix/readdir_r.c:63
#2  0x0000000005a6a5c1 in boost::filesystem::detail::directory_iterator_increment(boost::filesystem::directory_iterator&, boost::system::error_code*) ()
#3  0x0000000005a6ee6a in boost::filesystem::detail::directory_iterator_construct(boost::filesystem::directory_iterator&, boost::filesystem::path const&, boost::system::error_code*) ()
#4  0x0000000004d6e488 in fc::directory_iterator::directory_iterator(fc::path const&) ()
#5  0x0000000004d89b1e in fc::file_appender::impl::delete_files() ()
#6  0x0000000004d8a0ac in fc::detail::void_functor_run<fc::file_appender::impl::delete_files()::{lambda()#1}>::run(void*, fc::detail::void_functor_run<fc::file_appender::impl::delete_files()::{lambda()#1}>) ()
#7  0x0000000004d22324 in fc::task_base::run_impl() ()
#8  0x0000000004d1fcff in fc::thread_d::process_tasks() ()
#9  0x0000000004d2045c in fc::thread_d::start_process_tasks(long) ()
#10 0x0000000005aacc21 in make_fcontext ()
#11 0x0000000000000000 in ?? ()

Thanks to @syalon for reporting.

Impacts
Describe which portion(s) of BitShares Core may be impacted by this bug. Please tick at least one box.

  • API (the application programming interface)
  • Build (the build process or something prior to compiled code)
  • CLI (the command line wallet)
  • Deployment (the deployment process after building such as Docker, Travis, etc.)
  • DEX (the Decentralized EXchange, market engine, etc.)
  • P2P (the peer-to-peer network for transaction/block propagation)
  • Performance (system or user efficiency, etc.)
  • Protocol (the blockchain logic, consensus, validation, etc.)
  • Security (the security of system or user data, etc.)
  • UX (the User Experience)
  • Other (please add below)

Steps To Reproduce
Restart witness_node, then wait.

Host Environment
Please provide details about the host environment. Much of this information can be found running: witness_node --version.

  • Host OS: Ubuntu 16.04 LTS
  • Host Physical RAM 16GB
  • BitShares Version: test-4.0.3
  • OpenSSL Version: 1.0.2g
  • Boost Version: 1.58

CORE TEAM TASK LIST

  • Evaluate / Prioritize Bug Report
  • Refine User Stories / Requirements
  • Define Test Cases
  • Design / Develop Solution
  • Perform QA/Testing
  • Update Documentation
@pmconrad
Copy link
Contributor

pmconrad commented Dec 9, 2019

Thread 18 looks like gdb has a problem with boost's context - the stack of that fc::thread ends at make_fcontext.

A possible cause in thread 1 is an apparent missing initialization in fc::file_appender::next_file_time. A consequence of this problem would be that the corresponding log file is never created. Can you confirm this? I don't see how the problem could resolve itself though, at least not that quickly.

Next time you see this, please use top -H (on Linux) to find out which thread is causing the CPU load. That would help identify the culprit.

@pmconrad
Copy link
Contributor

pmconrad commented Dec 9, 2019

A possible cause in thread 1 is an apparent missing initialization in fc::file_appender::next_file_time.

I think I was wrong, the fc::microseconds constructor has a default-0 argument.

@pmconrad
Copy link
Contributor

pmconrad commented Dec 9, 2019

Note that 100% (or more) CPU can be normal during syncing. In testnet in particular, during the load-test blocks, there is no visible syncing progress with high CPU load. At which block did that happen?

@abitmore
Copy link
Member Author

abitmore commented Dec 9, 2019

Next time you see this, please use top -H (on Linux) to find out which thread is causing the CPU load. That would help identify the culprit.

It's thread 1.

Note that 100% (or more) CPU can be normal during syncing. In testnet in particular, during the load-test blocks, there is no visible syncing progress with high CPU load. At which block did that happen?

It's a node which is in sync already. To reproduce the issue, restart a synced node, check after the time passed X:00:00.

@pmconrad
Copy link
Contributor

Thanks.

I think what happens is that delete_files runs after the rotation interval, before current_file_number has been incremented to the next file. This leads to delete_files being scheduled again, immediately. This loop persists until current_file_number is incremented due to some message being logged.

Reproducability depends on log settings.

@pmconrad
Copy link
Contributor

Was able to reproduce the problem with log-level warn and a short rotation interval. At the interval, CPU usage jumps up to 100% until the next message is logged, as expected.
Verified that the problem is fixed with bitshares/bitshares-fc#184 .

@pmconrad pmconrad added 2f Testing Status indicating the solution is being evaluated against the test case(s) 3d Bug Classification indicating the existing implementation does not match the intention of the design 6 Performance Impacts flag identifying system/user efficiency, performance, etc. 9a Tiny Effort estimation indicating TBD labels Dec 10, 2019
@pmconrad pmconrad self-assigned this Dec 10, 2019
@bobinson
Copy link

@pmconrad - what was the rotation intervals under which this was caused ?

@pmconrad
Copy link
Contributor

It's independent from the rotation interval, I just used a low interval for easier testing (didn't want to wait for an hour each time).
The behaviour can be seen easily if the default logger logs to a file_appender with level warn, i. e. it produces only few messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2f Testing Status indicating the solution is being evaluated against the test case(s) 3d Bug Classification indicating the existing implementation does not match the intention of the design 6 Performance Impacts flag identifying system/user efficiency, performance, etc. 9a Tiny Effort estimation indicating TBD
Projects
None yet
Development

No branches or pull requests

3 participants