Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: parallel/test-tls-ticket-cluster #29341

Open
mscdex opened this issue Aug 27, 2019 · 10 comments
Open

Flaky test: parallel/test-tls-ticket-cluster #29341

mscdex opened this issue Aug 27, 2019 · 10 comments
Labels
child_process Issues and PRs related to the child_process subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX. tls Issues and PRs related to the tls subsystem.

Comments

@mscdex
Copy link
Contributor

mscdex commented Aug 27, 2019

  • Version: master
  • Platform: test-macstadium-macos10.11-x64-1
  • Subsystem: test

Console output

04:34:35 not ok 2094 parallel/test-tls-ticket-cluster
04:34:35   ---
04:34:35   duration_ms: 0.579
04:34:35   severity: fail
04:34:35   exitcode: 1
04:34:35   stack: |-
04:34:35     [master] got "listening"
04:34:35     [master] connecting 60380 session? false
04:34:35     [worker] connection reused? false
04:34:35     [master] got "not-reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [master] got "listening"
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     events.js:186
04:34:35           throw er; // Unhandled 'error' event
04:34:35           ^
04:34:35     
04:34:35     Error: write EPIPE
04:34:35         at ChildProcess.target._send (internal/child_process.js:813:20)
04:34:35         at ChildProcess.target.send (internal/child_process.js:683:19)
04:34:35         at sendHelper (internal/cluster/utils.js:22:15)
04:34:35         at send (internal/cluster/master.js:351:10)
04:34:35         at internal/cluster/master.js:317:5
04:34:35         at done (internal/cluster/round_robin_handle.js:46:7)
04:34:35         at RoundRobinHandle.add (internal/cluster/round_robin_handle.js:55:12)
04:34:35         at queryServer (internal/cluster/master.js:311:10)
04:34:35         at Worker.onmessage (internal/cluster/master.js:246:5)
04:34:35         at ChildProcess.onInternalMessage (internal/cluster/utils.js:43:8)
04:34:35     Emitted 'error' event on Worker instance at:
04:34:35         at ChildProcess.<anonymous> (internal/cluster/worker.js:27:12)
04:34:35         at ChildProcess.emit (events.js:209:13)
04:34:35         at internal/child_process.js:817:39
04:34:35         at processTicksAndRejections (internal/process/task_queues.js:75:11) {
04:34:35       errno: -32,
04:34:35       code: 'EPIPE',
04:34:35       syscall: 'write'
04:34:35     }
@mscdex mscdex added flaky-test Issues and PRs related to the tests with unstable failures on the CI. tls Issues and PRs related to the tls subsystem. labels Aug 27, 2019
@mscdex
Copy link
Contributor Author

mscdex commented Aug 28, 2019

/cc @nodejs/cluster ?

@Trott
Copy link
Member

Trott commented Aug 28, 2019

Also on FreeBSD:

https://ci.nodejs.org/job/node-test-commit-freebsd/28120/nodes=freebsd11-x64/console

00:15:03 not ok 1983 parallel/test-tls-ticket-cluster
00:15:03   ---
00:15:03   duration_ms: 0.865
00:15:03   severity: fail
00:15:03   exitcode: 1
00:15:03   stack: |-
00:15:03     [master] got "listening"
00:15:03     [master] connecting 28298 session? false
00:15:03     [worker] connection reused? false
00:15:03     [master] got "not-reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [master] worker died
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [master] worker died
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     events.js:186
00:15:03           throw er; // Unhandled 'error' event
00:15:03           ^
00:15:03     
00:15:03     Error: write EPIPE
00:15:03         at ChildProcess.target._send (internal/child_process.js:813:20)
00:15:03         at ChildProcess.target.send (internal/child_process.js:683:19)
00:15:03         at sendHelper (internal/cluster/utils.js:22:15)
00:15:03         at send (internal/cluster/master.js:351:10)
00:15:03         at internal/cluster/master.js:317:5
00:15:03         at Server.done (internal/cluster/round_robin_handle.js:46:7)
00:15:03         at Object.onceWrapper (events.js:298:28)
00:15:03         at Server.emit (events.js:214:15)
00:15:03         at emitListeningNT (net.js:1332:10)
00:15:03         at processTicksAndRejections (internal/process/task_queues.js:79:21)
00:15:03     Emitted 'error' event on Worker instance at:
00:15:03         at ChildProcess.<anonymous> (internal/cluster/worker.js:27:12)
00:15:03         at ChildProcess.emit (events.js:209:13)
00:15:03         at internal/child_process.js:817:39
00:15:03         at processTicksAndRejections (internal/process/task_queues.js:75:11) {
00:15:03       errno: -32,
00:15:03       code: 'EPIPE',
00:15:03       syscall: 'write'
00:15:03     }
00:15:03   ...

@Trott
Copy link
Member

Trott commented Aug 28, 2019

I'm so far unable to replicate this locally, but EPIPE can happen if you try to write to a closed stream, and there have been some changes to streams lately... @nodejs/streams? @ronag?

@lpinca
Copy link
Member

lpinca commented Aug 31, 2019

I can reproduce with

$ ./tools/test.py -J --repeat=500 test/parallel/test-tls-ticket-cluster.js

and bisecting points to 4367732 as first bad commit.

@nodejs/libuv

@lpinca lpinca added child_process Issues and PRs related to the child_process subsystem. macos Issues and PRs related to the macOS platform / OSX. labels Aug 31, 2019
@bnoordhuis
Copy link
Member

bnoordhuis commented Aug 31, 2019

That probably means libuv/libuv@ee24ce9 is responsible (cc @addaleax) but that also means it's probably exposing a latent bug in Node.js (edit: or the test), not introducing a new one.

The sole uv_try_write() caller is below and that err == UV_EAGAIN check looks highly suspect:

node/src/stream_wrap.cc

Lines 350 to 354 in 98b7185

err = uv_try_write(stream(), vbufs, vcount);
if (err == UV_ENOSYS || err == UV_EAGAIN)
return 0;
if (err < 0)
return err;

@addaleax
Copy link
Member

It looks more like issues with the cluster/child process code to me… currently, synchronous write errors from .send() are reported (but there is an ancient undocumented option to turn that off, swallowErrors), while asynchronous write errors are always discarded silently.

It generally also looks like there’s very little error handling in the cluster module in general, and in particular, there’s this bit that’s also in the stack trace:

// TODO(bnoordhuis) Check err.
send(null, { sockname: out }, null);
} else {
send(null, null, null); // UNIX socket.

Judging from the commit message of 0da4e0e (which added swallowErrors), I’d think it make sense to add that flag for internal cluster messages, too.

/cc @cjihrig

The sole uv_try_write() caller is below and that err == UV_EAGAIN check looks highly suspect:

I think UV_ENOSYS and UV_EAGAIN are precisely the return values that indicate that it makes sense to use uv_write() after a failed uv_try_write(), correct?

@bnoordhuis
Copy link
Member

I think UV_ENOSYS and UV_EAGAIN are precisely the return values that indicate that it makes sense to use uv_write() after a failed uv_try_write(), correct?

Yes, but before libuv/libuv@ee24ce9 that path was always taken. The if (err < 0) below was dead code but now it's not.

@MylesBorins
Copy link
Contributor

This failure is showing up on v10.x now, specifically on centos7. Did this get figured out on master? Should we mark it as flaky?

@Trott
Copy link
Member

Trott commented Apr 4, 2020

This failure is showing up on v10.x now, specifically on centos7. Did this get figured out on master? Should we mark it as flaky?

Using ncu-ci walk, I can only find one recent example, and it's a node-daily-v10.x-staging daily job.

@puzpuzpuz
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. macos Issues and PRs related to the macOS platform / OSX. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

No branches or pull requests

7 participants