Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

server threads die with assertion error #6222

Closed
mxk1235 opened this issue Sep 13, 2013 · 38 comments
Closed

server threads die with assertion error #6222

mxk1235 opened this issue Sep 13, 2013 · 38 comments

Comments

@mxk1235
Copy link

mxk1235 commented Sep 13, 2013

using node 0.10.18 on Ubuntu 12.04 LTS.

../deps/uv/src/unix/stream.c:494: uv__server_io: Assertion `events == 1' failed

happens several times a second, effectively killing the threads. I found an old thread that is closed now because of no repro.

thanks in advance.

@mxk1235
Copy link
Author

mxk1235 commented Sep 13, 2013

to give more context, this is very closely related to this discussion joyent/libuv#826

we are using express 3.1.0 and cluster, which forks upto 7-10 servers that listen on the same port.

i would love to get a fix, but the discussion also mentions a workaround with a 5ms delay. any idea if that would work?

@bnoordhuis
Copy link
Member

Can you try this?

$ strace -o trace.log -fe \!mmap,munmap,mprotect,brk,gettimeofday,futex node script.js

Please post the trace.log here.

@mxk1235
Copy link
Author

mxk1235 commented Sep 13, 2013

fwiw, i used this script http://sprunge.us/FPjU with trace. the output is quite large with our code because the socket close event does not occur for ~60 minutes.

https://s3.amazonaws.com/pierrerize-public/trace.log

@bnoordhuis
Copy link
Member

Thanks. Can you post the output of uname -a on the affected system? From reading the trace, it looks like the process that aborts:

  1. receives a file descriptor (socket handle) from another process
  2. adds it to the epoll set with mode=EPOLLIN
  3. then calls epoll_wait() which returns mode=EPOLLOUT

It seems the file descriptor is supposed to be a listen socket but isn't. A test case would help because I can't exclude the possibility yet that there is a bug in your code somewhere.

@mxk1235
Copy link
Author

mxk1235 commented Sep 20, 2013

mike@mike-test-1:~$ uname -a
Linux mike-test-1 3.2.0-51-virtual #77-Ubuntu SMP Wed Jul 24 20:38:32 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

@bnoordhuis
Copy link
Member

Thanks. I received a couple of other reports and I'm reasonably sure by now that it's caused by stale events for a file descriptor that's been replaced by a file descriptor from another process. No ETA for a fix yet.

@ghost ghost assigned bnoordhuis Sep 20, 2013
@mxk1235
Copy link
Author

mxk1235 commented Sep 20, 2013

thanks for looking at it. from what i understand, it looks connected to the way cluster listens to sockets. does it make sense to have a delay when restarting the thread as a workaround? or perhaps there is an alternative to cluster that we can use.

@bnoordhuis
Copy link
Member

If by 'thread' you mean 'worker process', then there's no real way to work around it currently (apart from not using the cluster module, of course.)

@mxk1235
Copy link
Author

mxk1235 commented Sep 20, 2013

yes. thank you!

@mxk1235
Copy link
Author

mxk1235 commented Sep 30, 2013

hi, saw that this is added to 0.10 milestone. any other ETA info?

@bnoordhuis
Copy link
Member

Not really, I'm afraid. I'm on leave currently, I only work on node sporadically. Unless someone else picks this up (unlikely), it'll have to wait until I get back and have time to look into it.

@laino
Copy link

laino commented Oct 11, 2013

I'm running into this too. For some weird reason it only happens on one of two exactly the same way configured servers (even after re-installing node and npm) and I can't get it to work on the first one.

Is there any easy (or dirty) fix for this issue that doesn't involve not using the cluster module?
It's quite quite annoying as node crashes immediately after starting for me.

Thanks.

@anacronw
Copy link

I am also running into this issue. Just adding to the list if it helps :)

general stack is: cluster + express + redis

uname -a output (we are using CentOS):
2.6.32-358.11.1.el6.x86_64 #1 SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

The symptom is 100% cpu utilization on master. Running strace shows a bunch of file descriptor errors.

In my case, nothing really crashes and the application runs OK actually. Its just noticeable when using top.

Thanks for confirming the issue @bnoordhuis, I can turn off clustering in my case.

@laino
Copy link

laino commented Oct 19, 2013

For me it was fixed by listening to the ports immediately instead of waiting for a redis connection etc. Quite ugly as the server can't really process requests the moment he comes online now.

@vpArth
Copy link

vpArth commented Oct 22, 2013

This issue repeated for me when nodemon restarts application on file changes
application is simple cluster + restify workers

@indutny
Copy link
Member

indutny commented Oct 25, 2013

Should be fixed by joyent/libuv#971

@mxk1235
Copy link
Author

mxk1235 commented Oct 25, 2013

awesome! will try.

@vpArth
Copy link

vpArth commented Oct 28, 2013

Good
By a point, this can be fixed by new worker forking on exit event.
Implemented as restart option in the node-cluster-app

@trevnorris
Copy link

@mxk1235 can you confirm that the issue has been fixed?

@indutny
Copy link
Member

indutny commented Nov 12, 2013

Fixed in joyent/libuv@bbccafb and joyent/libuv@f50ccd5. Should get into the next v0.10 and v0.11 release.

@indutny indutny closed this as completed Nov 12, 2013
@mmastrac
Copy link

I'm still seeing this with 0.10.22 with our setup. What we're seeing is that when we run grunt as a task in Gradle, node dies with an assertion error when trying to launch a process in Karma:

grunt: ../deps/uv/src/unix/stream.c:494: uv__server_io: Assertion `events == 1' failed.

Oddly enough, it only happens if grunt runs a shell command before Karma attempts to run the browser, and only if grunt is run under Gradle.

I've run strace over it -- it looks like epoll_ctl starts throwing these soon after the shell execute finishes (search for rm -Rf):

29687 <... epoll_ctl resumed> ) = -1 EBADF (Bad file descriptor)

https://www.dropbox.com/s/geiat4cwb91sqt7/trace.log.gz

When run outside of Gradle (where it works), you can see the epoll running after the rm -Rf is successful:

https://www.dropbox.com/s/jq7osljb8x5j0kd/trace2.log.gz

@mmastrac
Copy link

This may be related to #6271, as the strace log was showing the same EPOLL_CTL_DEL behaviour.

@indutny indutny reopened this Nov 20, 2013
@indutny
Copy link
Member

indutny commented Nov 20, 2013

Reopening, will look into it soon.

@marckysharky
Copy link

Also have this issue when running Grunt via a Jenkins Job.

I have uploaded a truncated trace, via the following strace:

strace -o trace.log -fe \!mmap,munmap,mprotect,brk,gettimeofday,futex grunt

https://s3.amazonaws.com/grunt.errors/trace.log

System:

uname -a
Linux {truncated} 3.2.0-52-virtual #78-Ubuntu SMP Fri Jul 26 16:45:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

node --version
v0.10.22

npm --version
1.3.14

Node is via the chris-lea ppa (http://ppa.launchpad.net/chris-lea/node.js/ubuntu/pool/main/n/nodejs/)

@darthtrevino
Copy link

I've been seeing this issue in the last couple of weeks when using our Jenkins instance on Cloudbees:

uname -a
Linux localhost 3.7.10-101.fc17.x86_64 #1 SMP Wed Feb 27 19:14:22 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

@dmitry-vsl
Copy link

I also have this issue when running grunt-contrib-connect from jenkins.

@apaleslimghost
Copy link

Thanks @mmastrac, been trying to diagnose this for a day or two, removing a shell command fixes it.

@crashbell
Copy link

I got the same issue with Node v0.10.20

uname -a
Linux Witurn 3.2.0-27-generic #43-Ubuntu SMP Fri Jul 6 14:25:57 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

node --version
v0.10.20

npm --version
1.3.10

Should I upgrade Node version?

@indutny
Copy link
Member

indutny commented Jun 12, 2014

Yes, I would suggest to update node first.

@tsukasagenesis
Copy link

Same problem here:
Linux 73118 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:45:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
node v0.10.29 from ppa.launchpad.net_chris-lea_node.js_ubuntu_dists_trusty_main_binary-amd64_Package

seem reuse the bad file_descriptor, see strace :

read(8, "\1\0\0\0\0\0\0\0", 1024)       = 8
epoll_wait(5, {{EPOLLIN, {u32=25, u64=25}}}, 1024, 88) = 1
read(25, "", 65536)                     = 0
futex(0x7fbe8c0008c8, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_ctl(5, EPOLL_CTL_DEL, 25, {EPOLLRDNORM|EPOLLWRNORM|EPOLLHUP|0x1b3d820, {u32=0, u64=122681170965037056}}) = 0
close(25)                               = 0
futex(0xe13d84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xe13d80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0xe13d40, FUTEX_WAKE_PRIVATE, 1)  = 1
write(1, "fd: 24\n", 7)                 = 7
write(1, "11byebye\n", 9)               = 9
epoll_wait(5, {{EPOLLIN, {u32=11, u64=11}}, {EPOLLIN, {u32=8, u64=8}}}, 1024, 76) = 2
accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 24
accept4(11, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
read(8, "\1\0\0\0\0\0\0\0", 1024)       = 8
write(1, "11close2\n", 9)               = 9
futex(0xe13d84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xe13d80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0xe13d40, FUTEX_WAKE_PRIVATE, 1)  = 1
epoll_ctl(5, EPOLL_CTL_ADD, 24, {EPOLLIN, {u32=24, u64=24}}) = -1 EBADF (Bad file descriptor)
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(18944, 18944, SIGABRT)           = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=18944, si_uid=0} ---
+++ killed by SIGABRT (core dumped) +++

@ljwagerfield
Copy link

I also receive the following error when running grunt connect on Ubuntu 14.04 (x64) with node v0.10.25, but only when running from crontab. The same user is invoking the process in both crontab (which fails) and the manual execution (which succeeds).

grunt: ../deps/uv/src/unix/stream.c:494: uv__server_io: Assertionevents == 1' failed.`

@chicagobuss
Copy link

For people having problems with running these commands from jenkins or cron, try adding the --norc flag to your bash script - it worked for us and we were seeing the exact same behavior (worked when running manually, failing on cron and jenkins).

@jasnell
Copy link
Member

jasnell commented Jun 3, 2015

@indutny ... was there a resolution on this one?

@indutny
Copy link
Member

indutny commented Jun 3, 2015

I suppose it is resolved now, didn't see any report except this.

@indutny
Copy link
Member

indutny commented Jun 3, 2015

But no formal resolution either.

@jasnell
Copy link
Member

jasnell commented Jun 3, 2015

Ok, closing then. Can reopen if new information is received.

@jasnell jasnell closed this as completed Jun 3, 2015
@bostrom
Copy link

bostrom commented Sep 29, 2015

Any news on this one?

I'm getting this error when jenkins executes grunt karma on a Ubuntu slave. Adding #!/bin/bash --norc or just bash --norc to the beginning of my "Execute shell" build step doesn't work.

I've tried searching the net for answers, but I got nothing.

Slave:

uname -a
Linux jenkins-slave 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

node --version
v0.10.25

npm --version
1.3.10

@tsukasagenesis
Copy link

just never use FD directly, it's not correctly supported and if you close one he can still try to use it and crash. that all for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests