Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in 'connect' state with Cluster and Node 10.x #633

Closed
brettkiefer opened this issue May 30, 2018 · 14 comments
Closed

Stuck in 'connect' state with Cluster and Node 10.x #633

brettkiefer opened this issue May 30, 2018 · 14 comments

Comments

@brettkiefer
Copy link

I'm seeing ioredis 3.2.2 intermittently getting indefinitely stuck in the 'connect' state when run with Redis.Cluster under all versions of Node 10 up to the latest (10.3.0), and not under previous major versions (e.g. 9.11.1). I'm working on a minimal repro to illustrate this, but right now it looks as though the Node 10 stream is not emitting a 'close' event to trigger eventHandler.closeHandler(_this) under Node 10.

I'm not yet sure why this happens, but I thought I'd give a heads-up as I look at it in case anyone else is seeing the same.

@brettkiefer
Copy link
Author

This is the most minimal repro I can get. Reproduced on Node 10.1.0 and 10.3.0 with Ioredis 3.2.2 with the following script, which establishes 3 cluster connections, burns CPU for a few seconds, then tries keys at intervals. It frequently stalls out under Node 10, but always succeeds under 9.11.1. I can't repro without a cluster.

'use strict';

const Redis = require('ioredis');

// Dial up the number of Redis clusters (nRedis) to increase the chances that a
// given run will hit the bug and get a stalled cluster
const nRedis = 3;
const redises = Array.from(Array(nRedis).keys()).map(i => ({
  n: i,
  redis: new Redis.Cluster([
    { host: 'redis1', port: 6379 },
    { host: 'redis2', port: 6379 },
    { host: 'redis3', port: 6379 },
    { host: 'redis4', port: 6379 },
    { host: 'redis5', port: 6379 },
    { host: 'redis6', port: 6379 },
  ]),
}));

// Just tie up the CPU for about 3 seconds to foul up communication with the
// clusters
console.log('Start burning CPU');
const startBurn = new Date();
const randoms = [];
for (let i = 0; i < 5000000; i++) {
  randoms[i] = Math.random();
}
randoms.sort((a, b) => a < b);
const burned = new Date() - startBurn;
console.log(`Stop burning CPU, burned ${burned}ms`);

redises.map(({ n, redis }) => {
  let active = false;
  let successes = 0;
  setInterval(() => {
    if (active) {
      console.log(`### STUCK: CLUSTER (${n}) STALLED OUT ###`);
      process.exit(1);
    }
    active = Date.now();
    console.log(`Set redis key on cluster ${n}`);
    redis
      .set('test.set.key', Date.now())
      .then(() => {
        active = false;
        if (++successes === 3) {
          console.log(`<<< SUCCESS, WE DID NOT STALL OUT >>>`);
          process.exit(0);
        }
      })
      .done();
  }, 1000);
});

@brettkiefer
Copy link
Author

It's slightly harder to get it to stall out with a single connection (dialing down nRedis to 1), but after a few runs of that on Node 10.3.0 and IORedis debug enabled, I get:

Thu, 31 May 2018 20:46:52 GMT ioredis:cluster status: [empty] -> connecting
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis1:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis2:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis3:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis4:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis5:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis6:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:52 GMT ioredis:cluster getting slot cache from redis5:6379
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis5:6379]: wait -> connecting
Thu, 31 May 2018 20:46:52 GMT ioredis:redis queue command[0] -> cluster(slots)
Thu, 31 May 2018 20:46:52 GMT ioredis:redis status[redis2:6379]: wait -> connecting
Start burning CPU
Stop burning CPU, burned 3693ms
Thu, 31 May 2018 20:46:56 GMT ioredis:cluster getting slot cache from redis2:6379
Thu, 31 May 2018 20:46:56 GMT ioredis:redis queue command[0] -> cluster(slots)
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:46:56 GMT ioredis:redis write command[0] -> info()
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:46:56 GMT ioredis:redis queue command[0] -> info()
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connect -> ready
Thu, 31 May 2018 20:46:56 GMT ioredis:connection send 1 commands in offline queue
Thu, 31 May 2018 20:46:56 GMT ioredis:redis write command[0] -> cluster(slots)
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis1:6379]: wait -> close
Thu, 31 May 2018 20:46:56 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis1:6379]: close -> end
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis3:6379]: wait -> close
Thu, 31 May 2018 20:46:56 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis3:6379]: close -> end
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis4:6379]: wait -> close
Thu, 31 May 2018 20:46:56 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis4:6379]: close -> end
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis6:6379]: wait -> close
Thu, 31 May 2018 20:46:56 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[redis6:6379]: close -> end
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:46:56 GMT ioredis:cluster status: connecting -> connect
Thu, 31 May 2018 20:46:56 GMT ioredis:redis queue command[0] -> cluster(info)
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: ready -> close
Thu, 31 May 2018 20:46:56 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: close -> end
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: wait -> connecting
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:46:56 GMT ioredis:redis write command[0] -> info()
Thu, 31 May 2018 20:46:56 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connect -> ready
Set redis key on cluster 0
### STUCK: CLUSTER (0) STALLED OUT ###

@brettkiefer
Copy link
Author

On a healthy run (this is with Node 10.3.0 as well, but just on a run where we don't see the intermittent failure), we see the cluster get into the ready state instead of stalling out in 'connect':

Thu, 31 May 2018 20:56:09 GMT ioredis:cluster status: [empty] -> connecting
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis1:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis2:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis3:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis4:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis5:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis6:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:09 GMT ioredis:cluster getting slot cache from redis6:6379
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis6:6379]: wait -> connecting
Thu, 31 May 2018 20:56:09 GMT ioredis:redis queue command[0] -> cluster(slots)
Thu, 31 May 2018 20:56:09 GMT ioredis:redis status[redis2:6379]: wait -> connecting
Start burning CPU
Stop burning CPU, burned 3359ms
Thu, 31 May 2018 20:56:12 GMT ioredis:cluster getting slot cache from redis1:6379
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[redis1:6379]: wait -> connecting
Thu, 31 May 2018 20:56:12 GMT ioredis:redis queue command[0] -> cluster(slots)
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:56:12 GMT ioredis:redis queue command[0] -> info()
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:56:12 GMT ioredis:redis write command[0] -> info()
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:56:12 GMT ioredis:redis write command[0] -> info()
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connect -> ready
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connect -> ready
Thu, 31 May 2018 20:56:12 GMT ioredis:connection send 1 commands in offline queue
Thu, 31 May 2018 20:56:12 GMT ioredis:redis write command[0] -> cluster(slots)
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[redis3:6379]: wait -> close
Thu, 31 May 2018 20:56:12 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[redis3:6379]: close -> end
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[redis4:6379]: wait -> close
Thu, 31 May 2018 20:56:12 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[redis4:6379]: close -> end
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[redis5:6379]: wait -> close
Thu, 31 May 2018 20:56:12 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[redis5:6379]: close -> end
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: [empty] -> wait
Thu, 31 May 2018 20:56:12 GMT ioredis:cluster status: connecting -> connect
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: ready -> close
Thu, 31 May 2018 20:56:12 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: close -> end
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: ready -> close
Thu, 31 May 2018 20:56:12 GMT ioredis:connection skip reconnecting since the connection is manually closed.
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: close -> end
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: wait -> connecting
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:56:12 GMT ioredis:redis write command[0] -> info()
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connect -> ready
Thu, 31 May 2018 20:56:12 GMT ioredis:cluster getting slot cache from xx.xxx.x.xx:6379
Thu, 31 May 2018 20:56:12 GMT ioredis:redis write command[0] -> cluster(slots)
Thu, 31 May 2018 20:56:12 GMT ioredis:delayqueue send 1 commands in failover queue
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: wait -> connecting
Thu, 31 May 2018 20:56:12 GMT ioredis:redis queue command[0] -> cluster(info)
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connecting -> connect
Thu, 31 May 2018 20:56:12 GMT ioredis:redis write command[0] -> info()
Thu, 31 May 2018 20:56:12 GMT ioredis:redis status[xx.xxx.x.xx:6379]: connect -> ready
Thu, 31 May 2018 20:56:12 GMT ioredis:connection send 1 commands in offline queue
Thu, 31 May 2018 20:56:12 GMT ioredis:redis write command[0] -> cluster(info)
Thu, 31 May 2018 20:56:12 GMT ioredis:cluster status: connect -> ready
Set redis key on cluster 0
Thu, 31 May 2018 20:56:13 GMT ioredis:redis write command[0] -> set(test.set.key,1527800173790)
Set redis key on cluster 0
Thu, 31 May 2018 20:56:14 GMT ioredis:redis write command[0] -> set(test.set.key,1527800174791)
Set redis key on cluster 0
Thu, 31 May 2018 20:56:15 GMT ioredis:redis write command[0] -> set(test.set.key,1527800175792)
<<< SUCCESS, WE DID NOT STALL OUT >>>

@brettkiefer
Copy link
Author

This is the most minimal simulation I have been able to generate for what we're seeing for the good and bad cases in our actual application in staging on Node 10 and Cluster 3.2.2. This may boil down to a Node bug or some kind of improper usage on our end, but it seems like understanding what's causing the cluster to get stuck in 'connect' as the cluster(info) command goes through and the redis goes to 'ready' is the beginning, and that's where I'm having some trouble.

@brettkiefer
Copy link
Author

brettkiefer commented Jun 7, 2018

Just an update, I reproduced the issue with https://hub.docker.com/r/grokzen/redis-cluster/ and node 10.4.0 today, just copied over the script with

docker cp ./redisTest.js redis-cluster_1:/data/

then from within the container did

wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"
nvm install 9.11.1
nvm use 9.11.1
npm install ioredis
for i in {1..10}; do DEBUG=ioredis.* node ./redisTest.js; if [ $? -ne 0 ]; then break; fi; do

which all succeeded as expected, then did

nvm install 10.4.0
for i in {1..10}; do DEBUG=ioredis.* node ./redisTest.js; if [ $? -ne 0 ]; then break; fi; done

which broke on the first run with the expected pattern.

That's with config changes for the Docker cluster, so redisTest.js is

'use strict';

const Redis = require('ioredis');

// Dial up the number of Redis clusters (nRedis) to increase the chances that a
// given run will hit the bug and get a stalled cluster
const nRedis = 3;
const redises = Array.from(Array(nRedis).keys()).map(i => ({
  n: i,
  redis: new Redis.Cluster([
    { host: 'localhost', port: 7000 },
    { host: 'localhost', port: 7001 },
    { host: 'localhost', port: 7002 },
    { host: 'localhost', port: 7003 },
    { host: 'localhost', port: 7004 },
    { host: 'localhost', port: 7005 },
  ]),
}));

// Just tie up the CPU for about 3 seconds to foul up communication with the
// clusters
console.log('Start burning CPU');
const startBurn = new Date();
const randoms = [];
for (let i = 0; i < 5000000; i++) {
  randoms[i] = Math.random();
}
randoms.sort((a, b) => a < b);
const burned = new Date() - startBurn;
console.log(`Stop burning CPU, burned ${burned}ms`);

redises.map(({ n, redis }) => {
  let active = false;
  let successes = 0;
  setInterval(() => {
    if (active) {
      console.log(`### STUCK: CLUSTER (${n}) STALLED OUT ###`);
      process.exit(1);
    }
    active = Date.now();
    console.log(`Set redis key on cluster ${n}`);
    redis
      .set('test.set.key', Date.now())
      .then(() => {
        active = false;
        if (++successes === 3) {
          console.log(`<<< SUCCESS, WE DID NOT STALL OUT >>>`);
          process.exit(0);
        }
      })
      .done();
  }, 1000);
});

@brettkiefer
Copy link
Author

brettkiefer commented Jun 8, 2018

Unsurprisingly (this was the most suspicious-looking change in the changelog for this), this traces back to nodejs/node@9b7a6914a7. If you build Node at the previous revision (nodejs/node@74553465e6), this test does not repro, at 9b7a6914a7f0bd754e78b42b48c75851cfd6b3c4 it does.

It looks like there is some hubbub around 'close' events in the subsequent Node 10 changelogs, but concerning http, e.g. nodejs/node#20941.

@jcstanaway
Copy link

jcstanaway commented Jun 8, 2018

So you're suspecting that this is really a node issue and not an ioredis issue? Would probably to be good to log an issue there as well. Try to get some of those folks to re-review that commit. This issue is going to prevent folks using ioredis to upgrade to Node 10.

Though perhaps there was some breaking change that ioredis needs to handle.

@brettkiefer
Copy link
Author

brettkiefer commented Jun 8, 2018

Yes, I'm suspecting that Node isn't emitting 'close' under certain circumstances, and that that is a bug with that commit, but it'll probably take me a little while to boil it down to something that takes ioredis out of it.

Currently we can work around this for the purposes of this test by handling 'end' rather than close

diff --git a/lib/redis.js b/lib/redis.js
index 8870bec..39b9b53 100644
--- a/lib/redis.js
+++ b/lib/redis.js
@@ -279,7 +279,7 @@ Redis.prototype.connect = function (callback) {

       stream.once(CONNECT_EVENT, eventHandler.connectHandler(_this));
       stream.once('error', eventHandler.errorHandler(_this));
-      stream.once('close', eventHandler.closeHandler(_this));
+      stream.once('end', eventHandler.closeHandler(_this));
       stream.on('data', eventHandler.dataHandler(_this));

... but that seems very wrong, the 'close' handling seems more correct to me, and it looks as though that event should always be emitted.

@brettkiefer
Copy link
Author

brettkiefer commented Jun 8, 2018

Ah, ok, if the socket is paused there may not be a 'close', that's why it's semver-major:

nodejs/node#19241 (comment)

@brettkiefer
Copy link
Author

... but at the point when we get the EOF (nodejs/node@9b7a6914a7#diff-e6ef024c3775d787c38487a6309e491dL645), the connection isPaused() returns false, so it's not that.

The issue is that we get into onReadableStreamEnd with the socket's this.writable set to true, so the socket's maybeDestroy doesn't call socket.destroy and the socket never emits closed.

It's a little hard to tell if that's intentional or not, from the PR and checkin, but it's looking more like 'not'.

@brettkiefer
Copy link
Author

There is more on what exactly it looks like is happening with the socket at nodejs/node#19241 (comment) -- the long and short of it appears to be that if we hit the connect timeout and call .end before the socket is connected, the socket gets in a state that causes it to never call .destroy and thus never emit close. It might be reasonable to work around this issue in the connection timeout logic or by handling end events for the time being, as we work this out with an issue against Node.js (if they agree that the behavior is incorrect).

@brettkiefer
Copy link
Author

This will be fixed when this ships on Node 10: nodejs/node#21290. I'll leave this open until then in case anyone else sees it.

@jcstanaway
Copy link

Great investigation work!

@brettkiefer
Copy link
Author

@ccs018 Thanks! The fix is out with Node 10.5.0, so I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants