-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SentinelConnector set KnownNodes empty during redis failover, which causes app could not access redis any more #3017
Comments
Here is the modification to reproduce this issue. Stack trace[INFO] Running io.lettuce.core.masterreplica.MyTest
2024-10-18 17:11:17 [ERROR] [main] TEST: getNodes called count 1 (MasterReplicaTopologyRefresh:60)
2024-10-18 17:11:18 [ERROR] [lettuce-epollEventLoop-4-1] TEST: refresh getNodes initialNodes [RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM], RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA]] (MasterReplicaTopologyRefresh:65)
2024-10-18 17:11:19 [ERROR] [lettuce-epollEventLoop-4-2] TEST: nodeList after ping [RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA], RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM]] (MasterReplicaTopologyRefresh:85)
2024-10-18 17:11:19 [ERROR] [main] TEST: +sdown 1 (MyTest:47)
2024-10-18 17:11:24 [ERROR] [lettuce-eventExecutorLoop-1-4] TEST: getNodes called count 2 (MasterReplicaTopologyRefresh:60)
2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-6] TEST: refresh getNodes initialNodes [RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM], RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA]] (MasterReplicaTopologyRefresh:65)
2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: inject requestPing failing (MasterReplicaTopologyRefresh:75)
2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: Requests node redis://localhost:6482?timeout=10s set future to null (Requests:63)
2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: Requests node redis://localhost:6483?timeout=10s set future to null (Requests:63)
2024-10-18 17:11:24 [ERROR] [lettuce-epollEventLoop-4-8] TEST: nodeList after ping [] (MasterReplicaTopologyRefresh:85)
2024-10-18 17:11:24 [WARN ] [lettuce-epollEventLoop-4-8] Topology refresh returned no nodes from redis-sentinel://127.0.0.1,127.0.0.1:26380?sentinelMasterId=mymaster&timeout=10s (SentinelConnector:126)
2024-10-18 17:11:29 [ERROR] [main] TEST: +sdown 2 (MyTest:50)
2024-10-18 17:11:30 [ERROR] [main] TEST: +sdown 3 (MyTest:53)
2024-10-18 17:11:34 [ERROR] [lettuce-eventExecutorLoop-1-1] TEST: getNodes called count 3 (MasterReplicaTopologyRefresh:60)
2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-9] TEST: refresh getNodes initialNodes [RedisMasterReplicaNode [redisURI=redis://localhost:6482?timeout=10s, role=UPSTREAM], RedisMasterReplicaNode [redisURI=redis://localhost:6483?timeout=10s, role=REPLICA]] (MasterReplicaTopologyRefresh:65)
2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: inject requestPing failing (MasterReplicaTopologyRefresh:75)
2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: Requests node redis://localhost:6482?timeout=10s set future to null (Requests:63)
2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: Requests node redis://localhost:6483?timeout=10s set future to null (Requests:63)
2024-10-18 17:11:34 [ERROR] [lettuce-epollEventLoop-4-10] TEST: nodeList after ping [] (MasterReplicaTopologyRefresh:85)
2024-10-18 17:11:34 [WARN ] [lettuce-epollEventLoop-4-10] Topology refresh returned no nodes from redis-sentinel://127.0.0.1,127.0.0.1:26380?sentinelMasterId=mymaster&timeout=10s (SentinelConnector:126)
2024-10-18 17:11:40 [ERROR] [main] TEST: known nodes [] (MyTest:58)
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 26.33 s <<< FAILURE! -- in io.lettuce.core.masterreplica.MyTest
[ERROR] io.lettuce.core.masterreplica.MyTest.refreshEmptyNodes -- Time elapsed: 24.42 s <<< ERROR!
io.lettuce.core.RedisException: Master is currently unknown: []
at io.lettuce.core.masterreplica.MasterReplicaConnectionProvider.getMaster(MasterReplicaConnectionProvider.java:305)
at io.lettuce.core.masterreplica.MyTest.refreshEmptyNodes(MyTest.java:60)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) |
Hey @jiajunsu , Thanks for the verbose analysis. Before we jump into the way the Lettuce driver works let's first analyse the scenario and environment you are working with and why sentinel returns an empty list of nodes in the first place. How many sentinel processes are you using to test this scenario? Can you describe your complete environment? |
code in src/main/java/io/lettuce/core/masterreplica/MasterReplicaTopologyRefresh.java public Mono<List<RedisNodeDescription>> getNodes(RedisURI seed) {
CompletableFuture<List<RedisNodeDescription>> future = topologyProvider.getNodesAsync();
Mono<List<RedisNodeDescription>> initialNodes = Mono.fromFuture(future).doOnNext(nodes -> {
applyAuthenticationCredentials(nodes, seed);
});
return initialNodes.map(this::getConnections) // if initialNodes is empty, map `getConnections` will be skipped
.flatMap(asyncConnections -> asyncConnections.asMono(seed.getTimeout(), eventExecutors))
...
} And the empty list was made from
This could be reproduced by running |
Oh, okay, let me address these one at the time:
This is not a recommended setup for Redis Sentinel as you have to have at least 3 sentinel nodes to achieve master promotion and avoid split brain scenarios. See Example 1 and the explanation why this is a broken setup.
My bad, confused topology refresh with master/replica refresh. There are two types of topology refresh (check the SentinelConnector:connectAsync() method):
I think you are simulating an invalid state of the system, which the driver could not (reasonably) handle. The fix in #3019 is also not valid, because we will end up with a list of nodes that we know we cant connect to. At the moment the refresh is happening, there is basically no valid node to connect to, as the masters returned from the At this point I think this is an invalid scenario to consider. |
Sorry for the misleading response. In prod, we do have 3 sentiniel nodes. The 2 running sentinel is in my test env for reproducing this issue.
IMHO, a better way may could be:
I just compared the lettuce features between redis-cluster and redis-sentinel, maybe we could support |
I understand. I do - however - feel this would not happen if you use 3 sentinel nodes.
This test scenario simply simulates what would happen if a
Assuming we add this periodic refresh - which node should we ask for what type of information? |
The original case was happend in our production environment, with 3 sentinel nodes. And I tried to reproduce it by dropping down the replica and 2 sentinel nodes, it didn't happen in about 100 times. That means the trigger condition is too extream to reproduce in real environment. So I have to inject exceptions to lettuce code and test it in unittest env.
Because I failed to reproduce it in real env, and without debug logs, it's hard to get the detail reason why redis sentinel logsThis log is from sentinel-node3, at the same available zone with the master node 22:10:51.299 # +sdown sentinel sentinel-node1 @ mymaster
22:10:51.371 # +sdown slave slave-node @ mymaster
22:10:51.587 # +sdown sentinel sentinel-node2 @ mymaster I agree that leaving
We could make it the same way as the callback function for psub message, |
This is a fair point. We could improve the resiliency of the driver if we schedule some action in this particular scenario. I think we could model it in a similar manner to the This is no trivial task however, I will have to think on it's priority given our backlog and given the fact it is a bit of an exotic issue. Meanwhile contributions are welcome in that direction. |
Bug Report
Current Behavior
In redis sentinel mode, lettuce may refresh topology nodes to empty, if the connection to redis sentinel closed just after lettuce received TopologyRefreshMessage. And that will cause lettuce cannot recover anymore until received next TopologyRefreshMessage.
We have two redis nodes, and redis is running in sentinel mode.
Assume the redis nodes' names are redis1 and redis2, and redis1 is master at the beginning, we inject errors as below:
At step3, redis sentinel send
+sdown sentinel ...
to lettuce client, and trigger lettuce executing methodSentinelConnector::getTopologyRefreshRunnable
. While at the same time, the connection between lettuce client and redis sentinel is closed, and then error occured.lettuce logs below
And since that, the app could not handle any read and write operation because the knownNodes is empty.
By reviewing the git log, we've found this problem existed since commit Do not fail Master/Slave topology refresh if a node is not available. And it still exists at branch master;
When
Requests#getRequest
returnnull
orfuture.isDone()
returns false, it'll clearMasterReplicaConnectionProvider#knownNodes
and the lettuce client could not get recovered untill receiving next TopologyRefreshMessage, or the lettuce client process was restarted.src/main/java/io/lettuce/core/masterreplica/Requests.java
Environment
Possible Solution
We've checked the lettuce code, the code in
SentinelConnector::getTopologyRefreshRunnable
may be improved, to avoid setting empty list to knownNodes.src/main/java/io/lettuce/core/masterreplica/SentinelConnector.java
The text was updated successfully, but these errors were encountered: