-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AeronCluster client never reconnects to the cluster after all nodes have been stopped #1690
Comments
From that what I see your issue is next: I think it is expected because cluster forgot a client and just ignores keep alive messages (looks like). You just need to track it on the client side and do Hope that helps. |
The Cluster has not forgotten the client. It has timed out during the restart so it no longer has a valid session. Cluster is designed for high availability. If you take down all nodes in the cluster then it is not available and has to be recovered. |
Yes. That's correct. I implemented my own recovery as soon as I discovered this. I can work around that. The documentation or examples do not explicitly show at which stage you should attempt to reconnect to the cluster or even if Additionally. This means that to recover cluster connection you need to check The bottom line. I am not sure what is the expected behavior here, but maybe it could be useful for |
@mjpt777 Nevermind. |
@mjpt777 Actually I was wrong. This means that for the client to recover without making assumptions about some/all nodes being down you need to check additionally for ingress publication status after |
@the-thing you can check |
That's my point - it is not enough. It possible to be in a state where AeronCluster after initial successful connection is closed=false, but ingress publication is closed=true when you shutdown all the nodes.
I've just retested with Aeron version 1.46.7 and I it can be reproduced every time. |
Please submit a disabled failing test with a PR. |
Aeron version: 1.44.1
Java 17
Currently when running AeronCluster client against a single node cluster it is not possible to recover connection after stopping the node and starting it back again.
Client is constantly running and AeronCluster client agent is continuously polling for egress messages and sending keep alive messages, but it seems that it never receives the leadership event
io.aeron.cluster.client.AeronCluster#onNewLeader
for a single node cluster which might be the problem here.Keep alive results are taken from debugger logging output at
aeron/aeron-cluster/src/main/java/io/aeron/cluster/client/AeronCluster.java
Line 447 in 55535ce
I understand that running a single node cluster doesn't make much sense, except when running some local end to end tests, but in this case in my connection recovery code I have to check for
io.aeron.cluster.client.AeronCluster#sendKeepAlive
returning false (which can also be due to back pressure) and also checking forio.aeron.cluster.client.AeronCluster#ingressPublication
close state to recreate the cluster client viaio.aeron.cluster.client.AeronCluster#asyncConnect(io.aeron.cluster.client.AeronCluster.Context)
. Creating the new AeronCluster client seems to be the only way to reconnect to cluster after node restart.This doesn't seem to be a problem for a 3 node cluster. Stopping 2 out of 3 nodes and later starting one back again seems to recover automatically.
Also it seems that fully shutting down the 3 node cluster and bringing it back up again doesn't recover the connection.
The text was updated successfully, but these errors were encountered: