Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

node process blocked #7180

Closed
cynron opened this issue Feb 24, 2014 · 24 comments
Closed

node process blocked #7180

cynron opened this issue Feb 24, 2014 · 24 comments

Comments

@cynron
Copy link

cynron commented Feb 24, 2014

stack backtrace

(gdb) info threads
  Id   Target Id         Frame
  3    Thread 0x7fcfcaf61700 (LWP 31786) "SignalSender" 0x00007fcfc9f97420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7fcfc8ff1700 (LWP 31792) "node" 0x00007fcfc9cdb99d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
* 1    Thread 0x7fcfcaf63720 (LWP 31784) "node" 0x00007fcfc9f91e75 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fcfcaf63720 (LWP 31784))]
#0  0x00007fcfc9f91e75 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fcfc9f91e75 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00000000009c1fee in uv_thread_join (tid=<optimized out>) at ../deps/uv/src/unix/thread.c:37
#2  0x00000000009c26f2 in cleanup () at ../deps/uv/src/unix/threadpool.c:143
#3  0x00000000005af11a in __do_global_dtors_aux ()
#4  0x00000000009c73a1 in _fini ()
#5  0x00007fff7511d6a0 in ?? ()
#6  0x00007fcfcad5f4ec in ?? () from /lib64/ld-linux-x86-64.so.2
#7  0x00007fcfc9c36df2 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00007fcfc9c36e45 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x0000000000874f14 in node::FatalException(v8::TryCatch&) ()
#10 0x00000000008754d0 in node::MakeCallback(v8::Handle<v8::Object>, v8::Handle<v8::Function>, int, v8::Handle<v8::Value>*) ()
#11 0x00000000008755b9 in node::MakeCallback(v8::Handle<v8::Object>, v8::Handle<v8::String>, int, v8::Handle<v8::Value>*) ()
#12 0x0000000000890b43 in node::StreamWrap::AfterWrite(uv_write_s*, int) ()
#13 0x00000000009c0c9c in uv__write_callbacks (stream=<optimized out>) at ../deps/uv/src/unix/stream.c:886
#14 uv__stream_io (loop=<optimized out>, w=0x11cda40, events=<optimized out>) at ../deps/uv/src/unix/stream.c:1124
#15 0x00000000009b93af in uv__run_pending (loop=<optimized out>) at ../deps/uv/src/unix/core.c:580
#16 uv_run (loop=0xec2b40, mode=<optimized out>) at ../deps/uv/src/unix/core.c:311
#17 0x0000000000876562 in node::Start(int, char**) ()
#18 0x00007fcfc9c1eead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#19 0x00000000005af0a9 in _start ()

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fcfc8ff1700 (LWP 31792))]
#0  0x00007fcfc9cdb99d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fcfc9cdb99d in recvmsg () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fcfc9cfc19c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fcfc9cfc61a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fcfc9cb90c7 in getaddrinfo () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00000000009bc692 in uv__getaddrinfo_work (w=<optimized out>) at ../deps/uv/src/unix/getaddrinfo.c:33
#5  0x00000000009c2620 in worker (arg=<optimized out>) at ../deps/uv/src/unix/threadpool.c:74
#6  0x00000000009b7de1 in uv__thread_start (ctx_v=<optimized out>) at ../deps/uv/src/uv-common.c:322
#7  0x00007fcfc9f90b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007fcfc9cdaa7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x0000000000000000 in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fcfcaf61700 (LWP 31786))]
#0  0x00007fcfc9f97420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fcfc9f97420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000086ad15 in v8::internal::LinuxSemaphore::Wait() ()
#2  0x00000000007700c7 in v8::internal::RuntimeProfiler::WaitForSomeIsolateToEnterJS() ()
#3  0x000000000086c0d8 in v8::internal::SignalSender::Run() ()
#4  0x000000000086af58 in v8::internal::ThreadEntry(void*) ()
#5  0x00007fcfc9f90b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007fcfc9cdaa7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x0000000000000000 in ?? ()

@cynron
Copy link
Author

cynron commented Feb 24, 2014

Can anybody help me?

@indutny
Copy link
Member

indutny commented Feb 24, 2014

What OS are you using? It looks like getaddrinfo() never returns, which is kind of not-libuv bug.

@indutny
Copy link
Member

indutny commented Feb 24, 2014

Also, does it hang indefinitely or exit after a couple of minutes?

@halfblood369
Copy link

Thanks for your reply!
Debian GNU/Linux 6.0.4 (squeeze)
It hang indefinitely. And lots of tcp connection exist with Recv-Q data.

@indutny
Copy link
Member

indutny commented Feb 24, 2014

May be it is worth trying updating to 6.0.9?

@halfblood369
Copy link

ok, I will have a try.

@cynron
Copy link
Author

cynron commented Feb 27, 2014

#4  0x00000000009c73a1 in _fini ()
#5  0x00007fff7511d6a0 in ?? ()
#6  0x00007fcfcad5f4ec in ?? () from /lib64/ld-linux-x86-64.so.2
#7  0x00007fcfc9c36df2 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00007fcfc9c36e45 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x0000000000874f14 in node::FatalException(v8::TryCatch&) ()

we have handled uncaughtException as follows:

process.on('uncaughtException', function(e) {
   // do something
   // no call process.exit();
});

But stack frame #8 invokes syscall exit, I think it shouldn't happen.

This issue occured a few times, not frequently.

Another issue we met is that the node process disappears silently without remaining anything, such as crash log. we would find the service had stopped, then when we checked the server process, it disappeared. Also, this problem occured a few times not frequently too.

updating OS may not be worthy, because the issues not only occured on Debian 6.0.4, also on Ubuntu 12.04.

the issues are hard to reproduce.

Thanks.

@indutny
Copy link
Member

indutny commented Feb 27, 2014

@cynron is it separate issue? The first one you have reported had getaddrinfo() hanging.

@cynron
Copy link
Author

cynron commented Feb 27, 2014

Thank you for reply.

i am sorry, i did not state clearly, i want to say is stack frame #8 syscall exit is abnormal, it may be the origin of other problems.

@indutny
Copy link
Member

indutny commented Mar 4, 2014

Could you please provide any test case for one or both of the above problems?

@obastemur
Copy link

@cynron what is the node version exactly?

@cynron
Copy link
Author

cynron commented Mar 7, 2014

Thanks

@indutny Sorry to tell you it is hard to reproduce, so we can not extract any test case from our code.

@obastemur node v0.10.25

@indutny
Copy link
Member

indutny commented Mar 7, 2014

ok, closing :)

@indutny indutny closed this as completed Mar 7, 2014
@rprieto
Copy link

rprieto commented Aug 14, 2014

Sorry to bump up a closed issue, but we're seeing a similar problem - with similar gdb output.

#0  0x000000391020d720 in sem_wait () from /lib64/libpthread.so.0
#1  0x0000003ad2786f2d in v8::internal::LinuxSemaphore::Wait() () from /usr/lib64/libv8.so.3
#2  0x0000003ad279eaa0 in v8::internal::RuntimeProfiler::WaitForSomeIsolateToEnterJS() () from /usr/lib64/libv8.so.3
#3  0x0000003ad2788046 in v8::internal::SignalSender::Run() () from /usr/lib64/libv8.so.3
#4  0x0000003ad2787390 in ?? () from /usr/lib64/libv8.so.3
#5  0x0000003910207851 in start_thread () from /lib64/libpthread.so.0
#6  0x000000390fae894d in clone () from /lib64/libc.so.6

This happens systematically, every morning, and the symptoms are:

  • the process still accepts inbound HTTP requests, but doesn't respond to them (hence we're seeing ETIMEDOUT, not ECONNREFUSED)
  • it doesn't make any outbound HTTP requests anymore (it typically would, a few times a second)
  • process.on('SIGINT') and process.on('SIGTERM') don't seem to trigger anymore (it should be logging something, but doesn't)

This is node v0.10.22 on Red Hat Enterprise Linux Server release 6.4 (Santiago).
How would you go about troubleshooting it?

strace -fp <pid> shows all the threads seem to be stuck on a few FUTEX_WAIT_PRIVATE and EAGAIN (resource not available).

@tjfontaine
Copy link

There are lots of changes between 0.10.22 and 0.10.30 that may be related, can you please try a newer node?

@rprieto
Copy link

rprieto commented Aug 15, 2014

Thanks, I'll upgrade Node on our server and see what happens tomorrow morning.

@indutny
Copy link
Member

indutny commented Aug 15, 2014

@rprieto the backtrace that you have provided is totally unrelated to it anyway, this looks like a perfectly normal v8's profiler thread's state.

@rprieto
Copy link

rprieto commented Aug 15, 2014

Thanks, that's good to hear. What other tools can you recommend to understand what's happening?

@cynron
Copy link
Author

cynron commented Aug 21, 2014

@indutny after we upgraded the kernel, this issue never appear yet

@indutny
Copy link
Member

indutny commented Aug 22, 2014

Good news!

@ramesh-rayaprolu
Copy link

Can you please let us know which kernel are you using ? we are using CentOS 7
(kern: 3.10.0-123.el7.x86_64), and the node version is v0.10.32.
And, I think we are seeing the same issue...

@saibv
Copy link

saibv commented Oct 26, 2015

adding to what @ramesh-rayaprolu said, the node process is stuck at:

(gdb) bt
#0  0x00007fc02278611d in recvmsg () from /lib64/libc.so.6
#1  0x00007fc0227a675d in __netlink_request () from /lib64/libc.so.6
#2  0x00007fc0227a6a2f in getifaddrs_internal () from /lib64/libc.so.6
#3  0x00007fc0227a7710 in getifaddrs () from /lib64/libc.so.6
#4  0x00007fc0242847ca in uv_interface_addresses () from /lib64/libuv.so.0.10
#5  0x0000000000439612 in node::GetInterfaceAddresses(v8::Arguments const&) ()
#6  0x00003adaa5c4517b in ?? ()
#7  0x00007fff1b6c7c88 in ?? ()
#8  0x00007fff1b6c7c88 in ?? ()
#9  0x0000000000000000 in ?? ()

Also, we are running CentOS 7 with vmxnet3 .
There are several processes running, some are hung indefinitely, while others are unaffected.
There are tcp connections in close wait with pending recv q.

@bnoordhuis
Copy link
Member

@saibv That looks like a libc or kernel bug to me. You can maybe work around it by removing any os.networkInterfaces() calls you have in your code.

I'm curious, do the stuck processes consume CPU or are they completely idle? What do you see when you attach strace to a stuck process?

@saibv
Copy link

saibv commented Oct 28, 2015

Thanks @bnoordhuis , we will work around it without os.networkInterfaces() calls.

I guess it might be a kernel issue, I saw a couple of threads with this issue being reported in Fedora, and upgrading the kernel helped them (https://bugzilla.redhat.com/show_bug.cgi?id=1232347).

Interestingly, the CPU usage for the processes is mostly idle but keeps jumping to 5% .. 15% or so once every minutes. strace is just stuck in recvmsg.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants