CassandraSinkCluster: connection failure handling #1081
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Goal
Currently if we fail to create a connection to a destination node we just terminate the connection to the client.
This PR instead handles failure to connect by:
Implementation
To achieve this the logic from
CassandraSinkCluster::create_control_connection
is moved intonode_pool::get_accessible_node
and reused for every connection type not just control connections.Everywhere a connection is requested then has to handle failure by calling
send_error_in_response_to_message
/send_error_in_response_to_metadata
when a failure to create a connection occurs.I converted
get_round_robin_node_in_dc_rack
into aget_random_node_in_dc_rack
.A round robin solution would be a little faster then a randomizing solution as its cheaper to increment an index than to shuffle a list.
However they should both give equivalent load distributions.
I made this sacrifice as I know that we will be replacing it with a "power of two random choices" implementation eventually and implementing a round robin that was compatible with
get_accessible_node
would be possible but difficult to implement and prove correct.It would be nicer to separate retry concerns out of NodePool.
However I know that when we implement "power of two random choices" routing we will want to be able to skip a random choice that cannot form a connection.
As such I have put the connection testing logic within the routing functions, this will allow our future "power of two random choices" implementation to take advantage of that.