Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failures with no output/stack trace #19903

Closed
Trott opened this issue Apr 9, 2018 · 13 comments
Closed

CI failures with no output/stack trace #19903

Trott opened this issue Apr 9, 2018 · 13 comments
Labels
build Issues and PRs related to build files or the CI. flaky-test Issues and PRs related to the tests with unstable failures on the CI. libuv Issues and PRs related to the libuv dependency or the uv binding. test Issues and PRs related to the tests. tools Issues and PRs related to the tools directory.

Comments

@Trott
Copy link
Member

Trott commented Apr 9, 2018

  • Version: 10.0.0-pre
  • Platform: various Linux platforms
  • Subsystem: test

We've been seeing these for quite some time. They look like this:

not ok 123 parallel/test-some-test
  ---
  duration_ms: 0.456
  severity: fail
  stack: |-

No contents for stdout or stderr but the test failed. Huh? Wha?

Might be a problem in tools/test.py.

@BridgeAR and @addaleax have looked to varying degrees at this issue but haven't discovered anything as far as I know.

@refack and I have a PR open to add exit code to the output when these things happen in case that helps. #19855

@nodejs/testing @nodejs/build

@Trott Trott added build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests. tools Issues and PRs related to the tools directory. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Apr 9, 2018
@rvagg
Copy link
Member

rvagg commented Apr 10, 2018

Last round of investigation happened in #18998

@addaleax
Copy link
Member

Two things I found out, which don’t give a lot of insight but might narrow things down a bit:

  • This appears to happen more frequently when tests are not being run in parallel
  • Output that is inserted through process._rawDebug at the beginning of the script is shown by the test runner

@bnoordhuis
Copy link
Member

Does dmesg show anything afterwards?

@addaleax
Copy link
Member

@bnoordhuis Unfortunately, no. :/ There are only a handful lines reported after boot on the debian-x86 machine, a few from invalid incoming external UDP packages, and 7 segfault notices for node; but those don’t match the times (or the frequency) with which I could reproduce the issue on that machine.

@addaleax addaleax added the libuv Issues and PRs related to the libuv dependency or the uv binding. label Apr 10, 2018
@addaleax
Copy link
Member

@bnoordhuis Okay, this turns out to be a pretty deep rabbit hole.

The reported exit code according to #19855 is -6, which would be SIGABRT – that would explain why we don’t see any output – HasCrashed() in tools/test.py explicitly excludes SIGABRT:

node/tools/test.py

Lines 582 to 583 in 5e68172

return self.output.exit_code < 0 and \
self.output.exit_code != -signal.SIGABRT

Also, I found out there’s a magical place where all the core dumps go on debian machines with systemd – so we actually have those available, too, we just didn’t know that.

This is the stack trace I’m getting from the most recent core dump that I could produce:

#0  0xb775bd40 in __kernel_vsyscall ()
#1  0xb7453367 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#2  0xb7454a23 in __GI_abort () at abort.c:89
#3  0x0857e86c in uv_sem_post (sem=0xbf83b360) at ../deps/uv/src/unix/thread.c:520
#4  0x0856edff in worker (arg=0xbf83b360) at ../deps/uv/src/threadpool.c:55
#5  0xb75d8ecb in start_thread (arg=0xb50ffb40) at pthread_create.c:309
#6  0xb7510d0e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:129
(main thread stack)
#0  0x0850cda5 in node::TCPWrap::Connect(v8::FunctionCallbackInfo<v8::Value> const&) ()
#1  0x0866afc9 in v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) ()
#2  0x086de729 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#3  0x086deed1 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#4  0x3fd061be in ?? ()
#5  0x3fd11fd8 in ?? ()
#6  0x3fd0c11d in ?? ()
#7  0x3fd11fd8 in ?? ()
#8  0x3fd11fd8 in ?? ()
#9  0x3fd0c11d in ?? ()
#10 0x3fd11fd8 in ?? ()
#11 0x3fd11fd8 in ?? ()
#12 0x3fd0fe04 in ?? ()
#13 0x3fd09b51 in ?? ()
#14 0x089c589d in v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) ()
#15 0x08658bf6 in v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) ()
#16 0x08475359 in node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) ()
#17 0x0844b4d8 in node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) ()
#18 0x08450522 in node::cares_wrap::(anonymous namespace)::AfterGetAddrInfo(uv_getaddrinfo_s*, int, addrinfo*) ()
#19 0x085772e0 in uv__getaddrinfo_done (w=0xb342e94, status=0) at ../deps/uv/src/unix/getaddrinfo.c:134
#20 0x0856f23c in uv__work_done (handle=0x9ba2da0 <default_loop_struct+96>) at ../deps/uv/src/threadpool.c:251
#21 0x08571210 in uv__async_io (loop=0x9ba2d40 <default_loop_struct>, w=0x9ba2e30 <default_loop_struct+240>, events=1) at ../deps/uv/src/unix/async.c:118
#22 0x08581ff8 in uv__io_poll (loop=0x9ba2d40 <default_loop_struct>, timeout=-1) at ../deps/uv/src/unix/linux-core.c:400
#23 0x08571b59 in uv_run (loop=0x9ba2d40 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:368
#24 0x08480456 in node::Start(uv_loop_s*, int, char const* const*, int, char const* const*) ()
#25 0x0847f1da in node::Start(int, char**) ()
#26 0x084416da in main ()

(Also, in test-tls-buffersize, this always happens after the listen callback is finished, but before the socket is registered as connected. I assume that it’s related to moving the DNS lookup to the thread pool.)

The errno for the failing sem_post() call is EINVAL.

It seems this was introduced by libuv/libuv@647fbc0 – we picked this up in #18260 on Jan 24 on master, which roughly matches my feeling of when these things started happening. Also, reverting it seems to resolve the issue. (I got 7/20000 failures before, and now it’s down to 0/20000.)

I don’t see anything wrong with that commit. However, more debugging, and digging through the glibc source code and its git history brings up https://sourceware.org/bugzilla/show_bug.cgi?id=12674, which seems like it’s exactly the bug that we’re experiencing. (Here’s a pure-C reproduction that is similar to the libuv code and is fixed by bminor/glibc@042e152.)

The fix is included in glibc >= 2.21, but I’m not sure what that means for us going forward.

@rvagg
Copy link
Member

rvagg commented Apr 11, 2018

nice work!

For the record, the cores collected by systemd are only there because I had LimitCORE=unlimited turned on and an upgraded systemd installed (via jessie-backports) on the debian 8 machines because @joyeecheung requested core dumps. I've since removed LimitCORE (yesterday, via nodejs/build#1224) but there hasn't been a restart so they are still collecting cores until that happens. If someone needs cores again in the future we'll have to turn them back on. I think the problem that @joyeecheung ran into though was that the matching binaries don't persist so unless you're capturing the cores on the machine yourself, having them generated via our CI runs is not as useful.

addaleax added a commit to addaleax/libuv that referenced this issue Apr 12, 2018
Hack around https://sourceware.org/bugzilla/show_bug.cgi?id=12674
by providing a custom implementation for glibc < 2.21 in terms of other
concurrency primitives.

The glibc implementation on these versions is inherently unsafe.
So, while libuv and Node.js support those versions, it seems to make
sense for libuv in its functionality as a platform abstraction library
to provide a working version.

Fixes: nodejs/node#19903
@joyeecheung
Copy link
Member

joyeecheung commented Apr 12, 2018

I think the problem that @joyeecheung ran into though was that the matching binaries don't persist so unless you're capturing the cores on the machine yourself, having them generated via our CI runs is not as useful.

We could work around that by writing a custom script that saves the binary to somewhere and use it in /proc/sys/kernel/core_pattern, although that could be pretty involved operation-wise. When #19661 lands we might also turn on node-report in tests like this to get the JS stack and native stack, among other useful information.

addaleax added a commit to addaleax/node that referenced this issue Apr 12, 2018
Otherwise, potentially no output is shown for aborts.

Refs: nodejs#19903
@gireeshpunathil
Copy link
Member

@addaleax - I was playing around with your C-reproduce, but it always succeeded - is there any precondition for it to fail? my system's glibc is 2.11, and have ample # of CPUs.

@addaleax
Copy link
Member

@gireeshpunathil The glibc implementation is hardware-dependent, so I guess that might be an issue? Also, weirdly enough, this problem only seemed to reproduce when the machine was not under load…

@gireeshpunathil
Copy link
Member

@addaleax - thanks - that is a great clue. Less load means 2 threads stand great chance to run in 2 real CPUs in true parallel manner, increasing the chance of cluttered access. On the other hand if 2 sem_* threads share CPU, the scheduler will be more pragmatic and take the CPU at some meaningful edges such as system call boundary etc. which shields the problem.

BridgeAR pushed a commit to BridgeAR/node that referenced this issue Apr 13, 2018
Otherwise, potentially no output is shown for aborts.

PR-URL: nodejs#19990
Refs: nodejs#19903
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
jasnell pushed a commit that referenced this issue Apr 16, 2018
Otherwise, potentially no output is shown for aborts.

PR-URL: #19990
Refs: #19903
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
@addaleax
Copy link
Member

Re-opening until the libuv update lands on master

@addaleax addaleax reopened this Apr 17, 2018
@cjihrig
Copy link
Contributor

cjihrig commented Apr 17, 2018

@bnoordhuis @santigimeno @addaleax do you think this is a big enough issue that it's worth getting a libuv release out today?

@addaleax
Copy link
Member

@cjihrig I think we can guess the impact of this from looking at the frequency on the CI machine: On Node.js v8.x and v9.x, when used on an OS that ships those old glibc versions, every 3000th Node.js process or so crashes once it tries to do something interesting and the machine isn’t under load.

It’s not the end of the world, but I think it would be good if we could get a fix out in Node.js this month or so, in particular because this has also ended up in an LTS branch.

jasnell pushed a commit that referenced this issue Apr 23, 2018
PR-URL: #20129
Fixes: #20112
Fixes: #19903
Reviewed-By: Myles Borins <myles.borins@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Richard Lau <riclau@uk.ibm.com>
BridgeAR pushed a commit to BridgeAR/node that referenced this issue May 1, 2018
Otherwise, potentially no output is shown for aborts.

PR-URL: nodejs#19990
Refs: nodejs#19903
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
addaleax added a commit that referenced this issue Jun 29, 2018
Otherwise, potentially no output is shown for aborts.

PR-URL: #19990
Refs: #19903
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
MylesBorins pushed a commit that referenced this issue Aug 7, 2018
PR-URL: #20129
Fixes: #20112
Fixes: #19903
Reviewed-By: Myles Borins <myles.borins@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Richard Lau <riclau@uk.ibm.com>
rvagg pushed a commit that referenced this issue Aug 16, 2018
Otherwise, potentially no output is shown for aborts.

PR-URL: #19990
Refs: #19903
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
MylesBorins pushed a commit to MylesBorins/node that referenced this issue Nov 5, 2018
PR-URL: nodejs#20129
Fixes: nodejs#20112
Fixes: nodejs#19903
Reviewed-By: Myles Borins <myles.borins@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Richard Lau <riclau@uk.ibm.com>
MylesBorins pushed a commit that referenced this issue Nov 11, 2018
Backport-PR-URL: #24103
PR-URL: #20129
Fixes: #20112
Fixes: #19903
Reviewed-By: Myles Borins <myles.borins@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Richard Lau <riclau@uk.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. flaky-test Issues and PRs related to the tests with unstable failures on the CI. libuv Issues and PRs related to the libuv dependency or the uv binding. test Issues and PRs related to the tests. tools Issues and PRs related to the tools directory.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants