Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CassandraSinkCluster: connection failure handling #1081

Merged
merged 6 commits into from
Apr 3, 2023

Conversation

rukai
Copy link
Member

@rukai rukai commented Mar 14, 2023

Goal

Currently if we fail to create a connection to a destination node we just terminate the connection to the client.
This PR instead handles failure to connect by:

  1. attempting to connect to all other possible valid nodes
  2. failing that, returns an error message to the client and keeps the connection to the client open

Implementation

To achieve this the logic from CassandraSinkCluster::create_control_connection is moved into node_pool::get_accessible_node and reused for every connection type not just control connections.
Everywhere a connection is requested then has to handle failure by calling send_error_in_response_to_message/send_error_in_response_to_metadata when a failure to create a connection occurs.

I converted get_round_robin_node_in_dc_rack into a get_random_node_in_dc_rack.
A round robin solution would be a little faster then a randomizing solution as its cheaper to increment an index than to shuffle a list.
However they should both give equivalent load distributions.
I made this sacrifice as I know that we will be replacing it with a "power of two random choices" implementation eventually and implementing a round robin that was compatible with get_accessible_node would be possible but difficult to implement and prove correct.

It would be nicer to separate retry concerns out of NodePool.
However I know that when we implement "power of two random choices" routing we will want to be able to skip a random choice that cannot form a connection.
As such I have put the connection testing logic within the routing functions, this will allow our future "power of two random choices" implementation to take advantage of that.

@rukai rukai force-pushed the remove_single_rack_v4_error_ignores branch 3 times, most recently from 50a87c8 to f3f1875 Compare March 17, 2023 03:08
@rukai rukai force-pushed the remove_single_rack_v4_error_ignores branch 3 times, most recently from cac3f3f to 28bc7dc Compare March 24, 2023 03:10
@rukai rukai changed the title Remove error ignores from cassandra_int_tests::cluster_single_rack_v4 CassandraSinkCluster: connection failure handling Mar 24, 2023
@rukai rukai marked this pull request as ready for review March 24, 2023 10:07
@rukai rukai force-pushed the remove_single_rack_v4_error_ignores branch from 28bc7dc to 96803e2 Compare March 24, 2023 10:14
@rukai rukai force-pushed the remove_single_rack_v4_error_ignores branch from 96803e2 to 6a56b1e Compare March 27, 2023 08:13
Copy link
Member

@conorbros conorbros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd usually say be wary of putting too much retry handling into Shotover and instead rely on the drivers to handle retry logic I think this change is simple enough and makes sense.

@rukai
Copy link
Member Author

rukai commented Mar 28, 2023

Your right in that it makes sense to leave it to the client when we can because it reduces our code complexity.

In this case we need to at least:

  • issue a node.report_issue() to the dead node otherwise the client might hit the same dead node again
  • report an error back to the user

And since we got this far we may as well retry another node for the client.
If we can completely hide that a node went down to the user that sounds like a win to me.
I guess a fair chunk of the complexity comes from this final bit, so if we needed to cut down on complexity it could be removed.

@rukai rukai force-pushed the remove_single_rack_v4_error_ignores branch 2 times, most recently from 9d7c4cb to 3be9bc2 Compare March 29, 2023 07:05
@rukai rukai force-pushed the remove_single_rack_v4_error_ignores branch from 3be9bc2 to 943b214 Compare March 29, 2023 22:58
@rukai rukai merged commit 7b91370 into shotover:main Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants