AeronCluster client never reconnects to the cluster after all nodes have been stopped #1690

the-thing · 2024-11-16T15:19:37Z

Aeron version: 1.44.1
Java 17

Currently when running AeronCluster client against a single node cluster it is not possible to recover connection after stopping the node and starting it back again.

Client is constantly running and AeronCluster client agent is continuously polling for egress messages and sending keep alive messages, but it seems that it never receives the leadership event io.aeron.cluster.client.AeronCluster#onNewLeader for a single node cluster which might be the problem here.

Keep alive results are taken from debugger logging output at

aeron/aeron-cluster/src/main/java/io/aeron/cluster/client/AeronCluster.java

Line 447 in 55535ce

if (position > 0)

Keep alive result: 896
Keep alive result: 960
Keep alive result: 1024
Keep alive result: 1088
Keep alive result: 1152
[single node cluster stopped here]
Keep alive result: -4
Keep alive result: -4
Keep alive result: -4
2024-11-16 16:01:39.699 [test-client-7263566828269871104] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=23356506689600,nowNs=23357507749100,aeronClusterClosed=false,ingressPublicationClosed=true,egressSubscriptionClosed=false]
Keep alive result: -4

I understand that running a single node cluster doesn't make much sense, except when running some local end to end tests, but in this case in my connection recovery code I have to check for io.aeron.cluster.client.AeronCluster#sendKeepAlive returning false (which can also be due to back pressure) and also checking for io.aeron.cluster.client.AeronCluster#ingressPublication close state to recreate the cluster client via io.aeron.cluster.client.AeronCluster#asyncConnect(io.aeron.cluster.client.AeronCluster.Context). Creating the new AeronCluster client seems to be the only way to reconnect to cluster after node restart.

This doesn't seem to be a problem for a 3 node cluster. Stopping 2 out of 3 nodes and later starting one back again seems to recover automatically.

Keep alive result: 1088
Keep alive result: 1152
[two of the cluster nodes stopped]
Keep alive result: -1
Keep alive result: -1
Keep alive result: -1
2024-11-16 16:05:39.175 [client-32961435-eeaf-4097-bd22-69729c4968bb-7263567786887368704] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=23595981178300,nowNs=23596982924000,aeronClusterClosed=false,ingressPublicationClosed=false,egressSubscriptionClosed=false]
Keep alive result: -4
Keep alive result: -4
Keep alive result: -4
2024-11-16 16:05:45.107 [client-32961435-eeaf-4097-bd22-69729c4968bb-7263567786887368704] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=23595981178300,nowNs=23602915971900,aeronClusterClosed=false,ingressPublicationClosed=true,egressSubscriptionClosed=false]
[started 1 node so 2 out of 3 running]
Keep alive result: 64
Keep alive result: 128
Keep alive result: 192
Keep alive result: 256

Also it seems that fully shutting down the 3 node cluster and bringing it back up again doesn't recover the connection.

The text was updated successfully, but these errors were encountered:

RostyslavBaldovskyi · 2024-11-20T15:48:59Z

From that what I see your issue is next: "When entire cluster is shutdown and up again client does not recover a connection", and it is not directly related to cluster size.

I think it is expected because cluster forgot a client and just ignores keep alive messages (looks like). You just need to track it on the client side and do AeronCluster.connect again each time it is detected. Similar issue could happen with AeronArchive and there is a need to track it / reconnect manually.

Hope that helps.

mjpt777 · 2024-11-20T16:17:41Z

The Cluster has not forgotten the client. It has timed out during the restart so it no longer has a valid session. Cluster is designed for high availability. If you take down all nodes in the cluster then it is not available and has to be recovered.

the-thing · 2024-11-20T16:21:09Z

Yes. That's correct. I implemented my own recovery as soon as I discovered this. I can work around that.

The documentation or examples do not explicitly show at which stage you should attempt to reconnect to the cluster or even if AeronCluster client is supposed to automatically recover at all. I guess one should assume that auto recovery is not even supported, but then with some testing I spotted that it can recover in certain scenarios so I am not sure if this is expected.

Additionally. This means that to recover cluster connection you need to check io.aeron.cluster.client.AeronCluster#sendKeepAlive return value. If the value is false, you have to additionally check io.aeron.cluster.client.AeronCluster#ingressPublication#isClosed just in case of backpressure (exception is only raised when publication returns io.aeron.Publication#MAX_POSITION_EXCEEDED). If ingress publication is also closed then you should create a new connection. However, in certain situations this state can be auto recovered.

The bottom line. I am not sure what is the expected behavior here, but maybe it could be useful for io.aeron.cluster.client.AeronCluster#sendKeepAlive to return publication state instead of true/false flag so checking the ingress publication state is not required.

the-thing · 2024-11-20T16:39:25Z

@mjpt777 Nevermind. AeronCluster client throws cluster exception in case of ingress publication being closed in the newer version. Closing this.

the-thing · 2024-11-20T17:08:20Z

@mjpt777 Actually I was wrong. ClusterException was thrown before when ingress publication was closed, but it was removed in 41b6f30

This means that for the client to recover without making assumptions about some/all nodes being down you need to check additionally for ingress publication status after io.aeron.cluster.client.AeronCluster#sendKeepAlive return false.

mjpt777 · 2024-11-21T10:45:21Z

@the-thing you can check AeronCluster.isClosed() after polling egress. If closed then you need to re-connect.

the-thing · 2024-11-21T11:09:36Z

@mjpt777

That's my point - it is not enough. It possible to be in a state where AeronCluster after initial successful connection is closed=false, but ingress publication is closed=true when you shutdown all the nodes.

2024-11-21 12:00:56.684 [client-7265318164389916672-7265318164528865280] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=12448607946000,nowNs=12455493189700,aeronClusterClosed=false,ingressPublicationClosed=false]
[eventually ingress publication flips to closed=true, but aeron cluster not]
2024-11-21 12:00:56.686 [client-7265318164389916672-7265318164528865280] WARN BufferedConnection - Failed to send keep alive message [lastHeartbeatSentTimeNs=12448607946000,nowNs=12455495174100,aeronClusterClosed=false,ingressPublicationClosed=true]

I've just retested with Aeron version 1.46.7 and I it can be reproduced every time.

mjpt777 · 2024-11-21T11:11:11Z

Please submit a disabled failing test with a PR.

the-thing changed the title ~~AeronCluster client never reconnects to single node cluster after restart~~ AeronCluster client never reconnects to the cluster after all nodes have been stopped Nov 20, 2024

the-thing closed this as completed Nov 20, 2024

the-thing reopened this Nov 20, 2024

mjpt777 closed this as completed Nov 21, 2024

the-thing mentioned this issue Nov 21, 2024

[Java] AeronCluster client does not close itself after whole cluster shutdown #1691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AeronCluster client never reconnects to the cluster after all nodes have been stopped #1690

AeronCluster client never reconnects to the cluster after all nodes have been stopped #1690

the-thing commented Nov 16, 2024 •

edited

Loading

RostyslavBaldovskyi commented Nov 20, 2024

mjpt777 commented Nov 20, 2024

the-thing commented Nov 20, 2024 •

edited

Loading

the-thing commented Nov 20, 2024

the-thing commented Nov 20, 2024

mjpt777 commented Nov 21, 2024

the-thing commented Nov 21, 2024

mjpt777 commented Nov 21, 2024

AeronCluster client never reconnects to the cluster after all nodes have been stopped #1690

AeronCluster client never reconnects to the cluster after all nodes have been stopped #1690

Comments

the-thing commented Nov 16, 2024 • edited Loading

RostyslavBaldovskyi commented Nov 20, 2024

mjpt777 commented Nov 20, 2024

the-thing commented Nov 20, 2024 • edited Loading

the-thing commented Nov 20, 2024

the-thing commented Nov 20, 2024

mjpt777 commented Nov 21, 2024

the-thing commented Nov 21, 2024

mjpt777 commented Nov 21, 2024

the-thing commented Nov 16, 2024 •

edited

Loading

the-thing commented Nov 20, 2024 •

edited

Loading