Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AeronCluster client never reconnects to the cluster after all nodes have been stopped #1690

Closed
the-thing opened this issue Nov 16, 2024 · 8 comments

Comments

@the-thing
Copy link
Contributor

the-thing commented Nov 16, 2024

Aeron version: 1.44.1
Java 17

Currently when running AeronCluster client against a single node cluster it is not possible to recover connection after stopping the node and starting it back again.

Client is constantly running and AeronCluster client agent is continuously polling for egress messages and sending keep alive messages, but it seems that it never receives the leadership event io.aeron.cluster.client.AeronCluster#onNewLeader for a single node cluster which might be the problem here.

Keep alive results are taken from debugger logging output at

Keep alive result: 896
Keep alive result: 960
Keep alive result: 1024
Keep alive result: 1088
Keep alive result: 1152
[single node cluster stopped here]
Keep alive result: -4
Keep alive result: -4
Keep alive result: -4
2024-11-16 16:01:39.699 [test-client-7263566828269871104] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=23356506689600,nowNs=23357507749100,aeronClusterClosed=false,ingressPublicationClosed=true,egressSubscriptionClosed=false]
Keep alive result: -4

I understand that running a single node cluster doesn't make much sense, except when running some local end to end tests, but in this case in my connection recovery code I have to check for io.aeron.cluster.client.AeronCluster#sendKeepAlive returning false (which can also be due to back pressure) and also checking for io.aeron.cluster.client.AeronCluster#ingressPublication close state to recreate the cluster client via io.aeron.cluster.client.AeronCluster#asyncConnect(io.aeron.cluster.client.AeronCluster.Context). Creating the new AeronCluster client seems to be the only way to reconnect to cluster after node restart.

This doesn't seem to be a problem for a 3 node cluster. Stopping 2 out of 3 nodes and later starting one back again seems to recover automatically.

Keep alive result: 1088
Keep alive result: 1152
[two of the cluster nodes stopped]
Keep alive result: -1
Keep alive result: -1
Keep alive result: -1
2024-11-16 16:05:39.175 [client-32961435-eeaf-4097-bd22-69729c4968bb-7263567786887368704] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=23595981178300,nowNs=23596982924000,aeronClusterClosed=false,ingressPublicationClosed=false,egressSubscriptionClosed=false]
Keep alive result: -4
Keep alive result: -4
Keep alive result: -4
2024-11-16 16:05:45.107 [client-32961435-eeaf-4097-bd22-69729c4968bb-7263567786887368704] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=23595981178300,nowNs=23602915971900,aeronClusterClosed=false,ingressPublicationClosed=true,egressSubscriptionClosed=false]
[started 1 node so 2 out of 3 running]
Keep alive result: 64
Keep alive result: 128
Keep alive result: 192
Keep alive result: 256

Also it seems that fully shutting down the 3 node cluster and bringing it back up again doesn't recover the connection.

@RostyslavBaldovskyi
Copy link

From that what I see your issue is next: "When entire cluster is shutdown and up again client does not recover a connection", and it is not directly related to cluster size.

I think it is expected because cluster forgot a client and just ignores keep alive messages (looks like). You just need to track it on the client side and do AeronCluster.connect again each time it is detected. Similar issue could happen with AeronArchive and there is a need to track it / reconnect manually.

Hope that helps.

@mjpt777
Copy link
Contributor

mjpt777 commented Nov 20, 2024

The Cluster has not forgotten the client. It has timed out during the restart so it no longer has a valid session. Cluster is designed for high availability. If you take down all nodes in the cluster then it is not available and has to be recovered.

@the-thing
Copy link
Contributor Author

the-thing commented Nov 20, 2024

Yes. That's correct. I implemented my own recovery as soon as I discovered this. I can work around that.

The documentation or examples do not explicitly show at which stage you should attempt to reconnect to the cluster or even if AeronCluster client is supposed to automatically recover at all. I guess one should assume that auto recovery is not even supported, but then with some testing I spotted that it can recover in certain scenarios so I am not sure if this is expected.

Additionally. This means that to recover cluster connection you need to check io.aeron.cluster.client.AeronCluster#sendKeepAlive return value. If the value is false, you have to additionally check io.aeron.cluster.client.AeronCluster#ingressPublication#isClosed just in case of backpressure (exception is only raised when publication returns io.aeron.Publication#MAX_POSITION_EXCEEDED). If ingress publication is also closed then you should create a new connection. However, in certain situations this state can be auto recovered.

The bottom line. I am not sure what is the expected behavior here, but maybe it could be useful for io.aeron.cluster.client.AeronCluster#sendKeepAlive to return publication state instead of true/false flag so checking the ingress publication state is not required.

@the-thing the-thing changed the title AeronCluster client never reconnects to single node cluster after restart AeronCluster client never reconnects to the cluster after all nodes have been stopped Nov 20, 2024
@the-thing
Copy link
Contributor Author

@mjpt777 Nevermind. AeronCluster client throws cluster exception in case of ingress publication being closed in the newer version. Closing this.

@the-thing
Copy link
Contributor Author

@mjpt777 Actually I was wrong. ClusterException was thrown before when ingress publication was closed, but it was removed in 41b6f30

This means that for the client to recover without making assumptions about some/all nodes being down you need to check additionally for ingress publication status after io.aeron.cluster.client.AeronCluster#sendKeepAlive return false.

@the-thing the-thing reopened this Nov 20, 2024
@mjpt777
Copy link
Contributor

mjpt777 commented Nov 21, 2024

@the-thing you can check AeronCluster.isClosed() after polling egress. If closed then you need to re-connect.

@mjpt777 mjpt777 closed this as completed Nov 21, 2024
@the-thing
Copy link
Contributor Author

@mjpt777

That's my point - it is not enough. It possible to be in a state where AeronCluster after initial successful connection is closed=false, but ingress publication is closed=true when you shutdown all the nodes.

2024-11-21 12:00:56.684 [client-7265318164389916672-7265318164528865280] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=12448607946000,nowNs=12455493189700,aeronClusterClosed=false,ingressPublicationClosed=false]
[eventually ingress publication flips to closed=true, but aeron cluster not]
2024-11-21 12:00:56.686 [client-7265318164389916672-7265318164528865280] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=12448607946000,nowNs=12455495174100,aeronClusterClosed=false,ingressPublicationClosed=true]

I've just retested with Aeron version 1.46.7 and I it can be reproduced every time.

@mjpt777
Copy link
Contributor

mjpt777 commented Nov 21, 2024

Please submit a disabled failing test with a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants