Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate flaky test: test-cluster-net-listen-ipv6only-rr #25813

Closed
gireeshpunathil opened this issue Jan 30, 2019 · 4 comments · Fixed by #26298
Closed

investigate flaky test: test-cluster-net-listen-ipv6only-rr #25813

gireeshpunathil opened this issue Jan 30, 2019 · 4 comments · Fixed by #26298
Labels
cluster Issues and PRs related to the cluster subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. net Issues and PRs related to the net subsystem. test Issues and PRs related to the tests.

Comments

@gireeshpunathil
Copy link
Member

gireeshpunathil commented Jan 30, 2019

  • Version: master
  • Platform: linux
  • Subsystem: net, cluster

https://ci.nodejs.org/job/node-test-commit-linux/nodes=ubuntu1604-64/24988/consoleFull

09:43:31 not ok 333 parallel/test-cluster-net-listen-ipv6only-rr
09:43:31   ---
09:43:31   duration_ms: 0.512
09:43:31   severity: fail
09:43:31   exitcode: 1
09:43:31   stack: |-
09:43:31     events.js:173
09:43:31           throw er; // Unhandled 'error' event
09:43:31           ^
09:43:31     
09:43:31     Error: listen EADDRINUSE: address already in use 0.0.0.0:42338
09:43:31         at Server.setupListenHandle [as _listen2] (net.js:1232:14)
09:43:31         at listenInCluster (net.js:1280:12)
09:43:31         at doListen (net.js:1419:7)
09:43:31         at processTicksAndRejections (internal/process/next_tick.js:76:17)
09:43:31     Emitted 'error' event at:
09:43:31         at emitErrorNT (net.js:1259:8)
09:43:31         at processTicksAndRejections (internal/process/next_tick.js:76:17)
09:43:31   ...
@gireeshpunathil gireeshpunathil added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jan 30, 2019
@Trott
Copy link
Member

Trott commented Feb 16, 2019

https://ci.nodejs.org/job/node-test-commit-linux/25373/nodes=centos6-64-gcc6/consoleFull

test-softlayer-centos6-x64-1

00:17:45 not ok 332 parallel/test-cluster-net-listen-ipv6only-rr
00:17:45   ---
00:17:45   duration_ms: 0.754
00:17:45   severity: fail
00:17:45   exitcode: 1
00:17:45   stack: |-
00:17:45     events.js:173
00:17:45           throw er; // Unhandled 'error' event
00:17:45           ^
00:17:45     
00:17:45     Error: listen EADDRINUSE: address already in use 0.0.0.0:38498
00:17:45         at Server.setupListenHandle [as _listen2] (net.js:1232:14)
00:17:45         at listenInCluster (net.js:1280:12)
00:17:45         at doListen (net.js:1419:7)
00:17:45         at processTicksAndRejections (internal/process/next_tick.js:76:17)
00:17:45     Emitted 'error' event at:
00:17:45         at emitErrorNT (net.js:1259:8)
00:17:45         at processTicksAndRejections (internal/process/next_tick.js:76:17)
00:17:45   ...

@Trott
Copy link
Member

Trott commented Feb 24, 2019

https://ci.nodejs.org/job/node-test-commit-linuxone/10613/nodes=rhel72-s390x/testReport/(root)/test/parallel_test_cluster_net_listen_ipv6only_rr/

test-linuxonecc-rhel72-s390x-1

events.js:173
      throw er; // Unhandled 'error' event
      ^

Error: listen EADDRINUSE: address already in use 0.0.0.0:46458
    at Server.setupListenHandle [as _listen2] (net.js:1232:14)
    at listenInCluster (net.js:1280:12)
    at doListen (net.js:1419:7)
    at processTicksAndRejections (internal/process/next_tick.js:76:17)
Emitted 'error' event at:
    at emitErrorNT (net.js:1259:8)
    at processTicksAndRejections (internal/process/next_tick.js:76:17)

@Trott Trott added cluster Issues and PRs related to the cluster subsystem. net Issues and PRs related to the net subsystem. test Issues and PRs related to the tests. labels Feb 24, 2019
@gireeshpunathil
Copy link
Member Author

The program causes server listening 4 times by design:

  • thrice through the spawned cluster members, and
  • once through the master, upon the latch count down

First suspect of course is some entity listening twice on the same port

  • the cluster members doing so is not an issue, and is valid - as they wont bind port again, instead depend on master
  • the master doing so is an error.

So the next suspect is whether the count down latch is broken, and was somehow re-entered.

Used this patch to disprove that theory, as it never asserted with this change.

--- a/test/parallel/test-cluster-net-listen-ipv6only-rr.js
+++ b/test/parallel/test-cluster-net-listen-ipv6only-rr.js
@@ -15,11 +15,13 @@ cluster.schedulingPolicy = cluster.SCHED_RR;
 const host = '::';
 const WORKER_ACCOUNT = 3;
 
+let reenter = 0;
 if (cluster.isMaster) {
   const workers = new Map();
   let address;
 
   const countdown = new Countdown(WORKER_ACCOUNT, () => {
+    assert.strictEqual(++reenter, 1);
     // Make sure the `ipv6Only` option works.
     const server = net.createServer().listen({
       host: '0.0.0.0',

Next suspect is:

  • cluster node(s) start listening first. They populate the address field
  • as the port supplied was 0, they catch upon anything that is available.
  • it is also possible they ahng on to an existing (already listening by some other program) port
  • after 3 iterations, master picks up that port, and tries to bind, leading to the error.

Does this make sense?

@gireeshpunathil
Copy link
Member Author

looks like the above theory is confirmed:
$ cat foo.js

const cluster = require('cluster');
const net = require('net');
cluster.schedulingPolicy = cluster.SCHED_RR;
const host = '::';

let buf = [];
let j = 0;
if (cluster.isMaster) {
  for(var i = 0; i < 3000; i++) {
    net.createServer(() => {}).listen({host: '0.0.0.0', port: 0}, function() {
      buf.push(this.address().port)
      if(++j === 3000) {
        cluster.fork().on('listening', (addr) => {
          if(buf.includes(addr.port)) {
            console.log(`got it at ${addr.port}!`)
            process.exit(0);
          }
        })
      }
    });
  }
} else {
  setInterval(() => {
    net.createServer((q, r) => {}).listen({host: host, port: 0, ipv6Only: true}, () => {
    })
  }, 100)
}

$ ./node foo
got it at 43684!

So this means that the cluster member can just pick up any listening port. So clearly this conflicts in a parallel test execution environment.

gireeshpunathil added a commit to gireeshpunathil/node that referenced this issue Feb 28, 2019
In test test-cluster-net-listen-ipv6only-rr, the cluster member that
listens to `any` port actually has the potential to `grab` any port
from the environment which when passed onto the master causes
collision when it tries to listen on.

Moving the test to sequential alone is not sufficient as the cluster
member can in theory catch on to the admin ports on the host.

Assigning static port alone is also not sufficient, as it can interfere
with other running tests in the parallel category which would be mostly
running with `port: any` fashion.

So move to sequential, and use a static port.

Fixes: nodejs#25813
PR-URL: nodejs#26298
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Beth Griggs <Bethany.Griggs@uk.ibm.com>
addaleax pushed a commit that referenced this issue Mar 1, 2019
In test test-cluster-net-listen-ipv6only-rr, the cluster member that
listens to `any` port actually has the potential to `grab` any port
from the environment which when passed onto the master causes
collision when it tries to listen on.

Moving the test to sequential alone is not sufficient as the cluster
member can in theory catch on to the admin ports on the host.

Assigning static port alone is also not sufficient, as it can interfere
with other running tests in the parallel category which would be mostly
running with `port: any` fashion.

So move to sequential, and use a static port.

Fixes: #25813
PR-URL: #26298
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Beth Griggs <Bethany.Griggs@uk.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. net Issues and PRs related to the net subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants