Flaky test: parallel/test-tls-ticket-cluster #29341

mscdex · 2019-08-27T15:00:02Z

Version: master
Platform: test-macstadium-macos10.11-x64-1
Subsystem: test

04:34:35 not ok 2094 parallel/test-tls-ticket-cluster
04:34:35   ---
04:34:35   duration_ms: 0.579
04:34:35   severity: fail
04:34:35   exitcode: 1
04:34:35   stack: |-
04:34:35     [master] got "listening"
04:34:35     [master] connecting 60380 session? false
04:34:35     [worker] connection reused? false
04:34:35     [master] got "not-reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [master] got "listening"
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     events.js:186
04:34:35           throw er; // Unhandled 'error' event
04:34:35           ^
04:34:35     
04:34:35     Error: write EPIPE
04:34:35         at ChildProcess.target._send (internal/child_process.js:813:20)
04:34:35         at ChildProcess.target.send (internal/child_process.js:683:19)
04:34:35         at sendHelper (internal/cluster/utils.js:22:15)
04:34:35         at send (internal/cluster/master.js:351:10)
04:34:35         at internal/cluster/master.js:317:5
04:34:35         at done (internal/cluster/round_robin_handle.js:46:7)
04:34:35         at RoundRobinHandle.add (internal/cluster/round_robin_handle.js:55:12)
04:34:35         at queryServer (internal/cluster/master.js:311:10)
04:34:35         at Worker.onmessage (internal/cluster/master.js:246:5)
04:34:35         at ChildProcess.onInternalMessage (internal/cluster/utils.js:43:8)
04:34:35     Emitted 'error' event on Worker instance at:
04:34:35         at ChildProcess.<anonymous> (internal/cluster/worker.js:27:12)
04:34:35         at ChildProcess.emit (events.js:209:13)
04:34:35         at internal/child_process.js:817:39
04:34:35         at processTicksAndRejections (internal/process/task_queues.js:75:11) {
04:34:35       errno: -32,
04:34:35       code: 'EPIPE',
04:34:35       syscall: 'write'
04:34:35     }

The text was updated successfully, but these errors were encountered:

mscdex · 2019-08-28T07:18:49Z

/cc @nodejs/cluster ?

Trott · 2019-08-28T16:50:21Z

Also on FreeBSD:

https://ci.nodejs.org/job/node-test-commit-freebsd/28120/nodes=freebsd11-x64/console

00:15:03 not ok 1983 parallel/test-tls-ticket-cluster
00:15:03   ---
00:15:03   duration_ms: 0.865
00:15:03   severity: fail
00:15:03   exitcode: 1
00:15:03   stack: |-
00:15:03     [master] got "listening"
00:15:03     [master] connecting 28298 session? false
00:15:03     [worker] connection reused? false
00:15:03     [master] got "not-reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [master] worker died
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [master] worker died
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     events.js:186
00:15:03           throw er; // Unhandled 'error' event
00:15:03           ^
00:15:03     
00:15:03     Error: write EPIPE
00:15:03         at ChildProcess.target._send (internal/child_process.js:813:20)
00:15:03         at ChildProcess.target.send (internal/child_process.js:683:19)
00:15:03         at sendHelper (internal/cluster/utils.js:22:15)
00:15:03         at send (internal/cluster/master.js:351:10)
00:15:03         at internal/cluster/master.js:317:5
00:15:03         at Server.done (internal/cluster/round_robin_handle.js:46:7)
00:15:03         at Object.onceWrapper (events.js:298:28)
00:15:03         at Server.emit (events.js:214:15)
00:15:03         at emitListeningNT (net.js:1332:10)
00:15:03         at processTicksAndRejections (internal/process/task_queues.js:79:21)
00:15:03     Emitted 'error' event on Worker instance at:
00:15:03         at ChildProcess.<anonymous> (internal/cluster/worker.js:27:12)
00:15:03         at ChildProcess.emit (events.js:209:13)
00:15:03         at internal/child_process.js:817:39
00:15:03         at processTicksAndRejections (internal/process/task_queues.js:75:11) {
00:15:03       errno: -32,
00:15:03       code: 'EPIPE',
00:15:03       syscall: 'write'
00:15:03     }
00:15:03   ...

Trott · 2019-08-28T17:03:42Z

I'm so far unable to replicate this locally, but EPIPE can happen if you try to write to a closed stream, and there have been some changes to streams lately... @nodejs/streams? @ronag?

lpinca · 2019-08-31T10:09:51Z

I can reproduce with

$ ./tools/test.py -J --repeat=500 test/parallel/test-tls-ticket-cluster.js

and bisecting points to 4367732 as first bad commit.

@nodejs/libuv

bnoordhuis · 2019-08-31T11:25:32Z

That probably means libuv/libuv@ee24ce9 is responsible (cc @addaleax) but that also means it's probably exposing a latent bug in Node.js (edit: or the test), not introducing a new one.

The sole uv_try_write() caller is below and that err == UV_EAGAIN check looks highly suspect:

node/src/stream_wrap.cc

Lines 350 to 354 in 98b7185

    
           err = uv_try_write(stream(), vbufs, vcount); 
        
           if (err == UV_ENOSYS || err == UV_EAGAIN) 
        
             return 0; 
        
           if (err < 0) 
        
             return err;

addaleax · 2019-08-31T12:10:45Z

It looks more like issues with the cluster/child process code to me… currently, synchronous write errors from .send() are reported (but there is an ancient undocumented option to turn that off, swallowErrors), while asynchronous write errors are always discarded silently.

It generally also looks like there’s very little error handling in the cluster module in general, and in particular, there’s this bit that’s also in the stack trace:

node/lib/internal/cluster/round_robin_handle.js

Lines 45 to 48 in 98b7185

    
             // TODO(bnoordhuis) Check err. 
        
             send(null, { sockname: out }, null); 
        
           } else { 
        
             send(null, null, null);  // UNIX socket.

Judging from the commit message of 0da4e0e (which added swallowErrors), I’d think it make sense to add that flag for internal cluster messages, too.

/cc @cjihrig

The sole uv_try_write() caller is below and that err == UV_EAGAIN check looks highly suspect:

I think UV_ENOSYS and UV_EAGAIN are precisely the return values that indicate that it makes sense to use uv_write() after a failed uv_try_write(), correct?

bnoordhuis · 2019-09-01T10:46:00Z

I think UV_ENOSYS and UV_EAGAIN are precisely the return values that indicate that it makes sense to use uv_write() after a failed uv_try_write(), correct?

Yes, but before libuv/libuv@ee24ce9 that path was always taken. The if (err < 0) below was dead code but now it's not.

MylesBorins · 2020-04-01T23:15:52Z

This failure is showing up on v10.x now, specifically on centos7. Did this get figured out on master? Should we mark it as flaky?

Trott · 2020-04-04T20:45:39Z

This failure is showing up on v10.x now, specifically on centos7. Did this get figured out on master? Should we mark it as flaky?

Using ncu-ci walk, I can only find one recent example, and it's a node-daily-v10.x-staging daily job.

puzpuzpuz · 2020-04-23T15:10:34Z

Saw this one failing on OS X today: https://ci.nodejs.org/job/node-test-commit-osx/33696/nodes=osx1015/testReport/junit/(root)/test/parallel_test_tls_ticket_cluster/

mscdex added flaky-test Issues and PRs related to the tests with unstable failures on the CI. tls Issues and PRs related to the tls subsystem. labels Aug 27, 2019

lpinca added child_process Issues and PRs related to the child_process subsystem. macos Issues and PRs related to the macOS platform / OSX. labels Aug 31, 2019

ChALkeR mentioned this issue Nov 6, 2019

Build from source error on Catalina #30276

Closed

mmarchini mentioned this issue Mar 7, 2020

deps: update V8 to 8.1 #32116

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: parallel/test-tls-ticket-cluster #29341

Flaky test: parallel/test-tls-ticket-cluster #29341

mscdex commented Aug 27, 2019

mscdex commented Aug 28, 2019

Trott commented Aug 28, 2019

Trott commented Aug 28, 2019

lpinca commented Aug 31, 2019 •

edited

Loading

bnoordhuis commented Aug 31, 2019 •

edited

Loading

addaleax commented Aug 31, 2019

bnoordhuis commented Sep 1, 2019

MylesBorins commented Apr 1, 2020

Trott commented Apr 4, 2020

puzpuzpuz commented Apr 23, 2020

Flaky test: parallel/test-tls-ticket-cluster #29341

Flaky test: parallel/test-tls-ticket-cluster #29341

Comments

mscdex commented Aug 27, 2019

mscdex commented Aug 28, 2019

Trott commented Aug 28, 2019

Trott commented Aug 28, 2019

lpinca commented Aug 31, 2019 • edited Loading

bnoordhuis commented Aug 31, 2019 • edited Loading

addaleax commented Aug 31, 2019

bnoordhuis commented Sep 1, 2019

MylesBorins commented Apr 1, 2020

Trott commented Apr 4, 2020

puzpuzpuz commented Apr 23, 2020

lpinca commented Aug 31, 2019 •

edited

Loading

bnoordhuis commented Aug 31, 2019 •

edited

Loading