Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect offline shotover nodes for KafkaSinkCluster #1762

Merged

Conversation

justinweng-instaclustr
Copy link
Collaborator

@justinweng-instaclustr justinweng-instaclustr commented Oct 3, 2024

After introducing ShotoverNodeState to ShotoverNode in #1758, we should add a task to detect down shotover nodes and set ShotoverNodeState accordingly.

This PR adds a background task check_shotover_peers looping over peer shotover nodes and trying to open a TCP connection to each peer shotover node. If the connection cannot be established within connect_timeout_ms, the peer node is marked as down.

  • connect_timeout_ms is the same configuration used when creating a connection to a destination kafka broker.
  • Each check is delayed for (check_shotover_peers_delay_ms + random(-check_shotover_peers_delay_ms/10, check_shotover_peers_delay_ms/10)) before moving to the next peer shotover node.
  • start_shotover_peers_check is called when the instance of KafkaSinkClusterBuilder is being created and hence is called exactly once.
  • check_shotover_peers is be invoked at all if there's no peer shotover node (i.e., there's only 1 shotover node in the cluster)
  • check_shotover_peers is restarted if the creation of random number generator fails.

The next PR will change metadata rewrites to exclude down shotover nodes.

Copy link

codspeed-hq bot commented Oct 3, 2024

CodSpeed Performance Report

Merging #1762 will degrade performances by 11.83%

Comparing justinweng-instaclustr:handle-offline-shotover-nodes (b1e7742) with main (2b11e0c)

Summary

❌ 1 regressions
✅ 38 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main justinweng-instaclustr:handle-offline-shotover-nodes Change
encode_system.local_result_v5_no_compression 93.1 µs 105.6 µs -11.83%

@justinweng-instaclustr justinweng-instaclustr changed the title Handle offline shotover nodes Detect offline shotover nodes Oct 4, 2024
@justinweng-instaclustr justinweng-instaclustr marked this pull request as ready for review October 8, 2024 03:07
@justinweng-instaclustr
Copy link
Collaborator Author

The regression benchmark encode_system.local_result_v5_no_compression is for Cassandra and hence a noise.

@justinweng-instaclustr justinweng-instaclustr changed the title Detect offline shotover nodes Detect offline shotover nodes for KafkaSinkCluster Oct 8, 2024
Copy link
Member

@rukai rukai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I've left some minor feedback.

shotover-proxy/tests/kafka_int_tests/mod.rs Show resolved Hide resolved
docs/src/transforms.md Show resolved Hide resolved
Copy link
Member

@rukai rukai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, happy for this to land as is, we can make use of shotover/tokio-bin-process#41 and shotover/tokio-bin-process#42 in a follow up cleanup if we want.

@rukai rukai enabled auto-merge (squash) October 9, 2024 03:13
@rukai rukai merged commit 9801ed4 into shotover:main Oct 9, 2024
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants