CassandraSinkCluster: connection failure handling #1081

rukai · 2023-03-14T23:28:14Z

Goal

Currently if we fail to create a connection to a destination node we just terminate the connection to the client.
This PR instead handles failure to connect by:

attempting to connect to all other possible valid nodes
failing that, returns an error message to the client and keeps the connection to the client open

Implementation

To achieve this the logic from CassandraSinkCluster::create_control_connection is moved into node_pool::get_accessible_node and reused for every connection type not just control connections.
Everywhere a connection is requested then has to handle failure by calling send_error_in_response_to_message/send_error_in_response_to_metadata when a failure to create a connection occurs.

I converted get_round_robin_node_in_dc_rack into a get_random_node_in_dc_rack.
A round robin solution would be a little faster then a randomizing solution as its cheaper to increment an index than to shuffle a list.
However they should both give equivalent load distributions.
I made this sacrifice as I know that we will be replacing it with a "power of two random choices" implementation eventually and implementing a round robin that was compatible with get_accessible_node would be possible but difficult to implement and prove correct.

It would be nicer to separate retry concerns out of NodePool.
However I know that when we implement "power of two random choices" routing we will want to be able to skip a random choice that cannot form a connection.
As such I have put the connection testing logic within the routing functions, this will allow our future "power of two random choices" implementation to take advantage of that.

shotover-proxy/src/transforms/cassandra/sink_cluster/mod.rs

conorbros

I'd usually say be wary of putting too much retry handling into Shotover and instead rely on the drivers to handle retry logic I think this change is simple enough and makes sense.

rukai · 2023-03-28T02:32:57Z

Your right in that it makes sense to leave it to the client when we can because it reduces our code complexity.

In this case we need to at least:

issue a node.report_issue() to the dead node otherwise the client might hit the same dead node again
report an error back to the user

And since we got this far we may as well retry another node for the client.
If we can completely hide that a node went down to the user that sounds like a win to me.
I guess a fair chunk of the complexity comes from this final bit, so if we needed to cut down on complexity it could be removed.

rukai force-pushed the remove_single_rack_v4_error_ignores branch 3 times, most recently from 50a87c8 to f3f1875 Compare March 17, 2023 03:08

rukai force-pushed the remove_single_rack_v4_error_ignores branch 3 times, most recently from cac3f3f to 28bc7dc Compare March 24, 2023 03:10

rukai changed the title ~~Remove error ignores from cassandra_int_tests::cluster_single_rack_v4~~ CassandraSinkCluster: connection failure handling Mar 24, 2023

rukai marked this pull request as ready for review March 24, 2023 10:07

rukai force-pushed the remove_single_rack_v4_error_ignores branch from 28bc7dc to 96803e2 Compare March 24, 2023 10:14

rukai requested review from conorbros and benbromhead March 27, 2023 03:04

rukai force-pushed the remove_single_rack_v4_error_ignores branch from 96803e2 to 6a56b1e Compare March 27, 2023 08:13

conorbros reviewed Mar 28, 2023

View reviewed changes

shotover-proxy/src/transforms/cassandra/sink_cluster/mod.rs Outdated Show resolved Hide resolved

conorbros approved these changes Mar 28, 2023

View reviewed changes

rukai force-pushed the remove_single_rack_v4_error_ignores branch 2 times, most recently from 9d7c4cb to 3be9bc2 Compare March 29, 2023 07:05

CassandraSinkCluster: connection failure handling

943b214

rukai force-pushed the remove_single_rack_v4_error_ignores branch from 3be9bc2 to 943b214 Compare March 29, 2023 22:58

benbromhead approved these changes Apr 3, 2023

View reviewed changes

conorbros and others added 5 commits April 3, 2023 14:02

Merge branch 'main' into remove_single_rack_v4_error_ignores

234c2bc

Merge branch 'main' into remove_single_rack_v4_error_ignores

b8bb379

Merge branch 'main' into remove_single_rack_v4_error_ignores

80a64cc

Merge branch 'main' into remove_single_rack_v4_error_ignores

de0973e

Merge branch 'main' into remove_single_rack_v4_error_ignores

4cdb4dd

rukai merged commit 7b91370 into shotover:main Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CassandraSinkCluster: connection failure handling #1081

CassandraSinkCluster: connection failure handling #1081

rukai commented Mar 14, 2023 •

edited

Loading

conorbros left a comment

rukai commented Mar 28, 2023

CassandraSinkCluster: connection failure handling #1081

CassandraSinkCluster: connection failure handling #1081

Conversation

rukai commented Mar 14, 2023 • edited Loading

Goal

Implementation

conorbros left a comment

Choose a reason for hiding this comment

rukai commented Mar 28, 2023

rukai commented Mar 14, 2023 •

edited

Loading