Skip to content

Conversation

@Mark-Simulacrum
Copy link
Collaborator

@Mark-Simulacrum Mark-Simulacrum commented Oct 8, 2025

Release Summary:

  • feat(s2n-quic-dc): throttle repeated successful handshakes

Resolved issues:

n/a

Description of changes:

This mirrors prior work for failing handshakes by adding a throttle on outgoing, successful handshakes to a given peer. This is particularly relevant for workloads that happen to trigger replay detection relatively often, which would otherwise cause us to repeatedly handshake with the peer.

It remains generally a good idea for us to switch to new secret material in case of e.g. spurious bitflips in memory or other issues that have broken the old entry in some way, so retaining the handshake-on-possible-replay makes sense, but doing so at a high rate increases CPU utilization for little benefit.

Call-outs:

n/a

Testing:

No particular testing added. This is exercise by most of our existing tests since the code is hit by anything that handshakes; I don't think we need additional, dedicated test coverage for it.

The stream tests are updated to disable this jitter as it breaks restart tests. As noted in comments, I think the tradeoff is reasonable, but open to other thoughts there (or other defaults instead of 1 minute).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Mark-Simulacrum Mark-Simulacrum marked this pull request as ready for review October 8, 2025 21:11
@boquan-fang boquan-fang self-requested a review October 8, 2025 22:03
boquan-fang
boquan-fang previously approved these changes Oct 9, 2025
rng.random_range(1..120)
rng.random_range(1000..120_000)
};
tokio::time::sleep(Duration::from_secs(duration)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the reason for this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the commit message:

As a drive-by improvement this also tweaks the delays to measure in
milliseconds, increasing the randomization.

Basically it's increasing the entropy in how much we randomize how long we'll sleep for. I doubt it matters much in practice though.

This mirrors prior work for *failing* handshakes by adding a throttle on
outgoing, successful handshakes to a given peer. This is particularly
relevant for workloads that happen to trigger replay detection
relatively often, which would otherwise cause us to repeatedly handshake
with the peer.

It remains generally a good idea for us to switch to new secret material
in case of e.g. spurious bitflips in memory or other issues that have
broken the old entry in some way, so retaining the
handshake-on-possible-replay makes sense, but doing so at a high rate
increases CPU utilization for little benefit.

As a drive-by improvement this also tweaks the delays to measure in
milliseconds, increasing the randomization.
@Mark-Simulacrum Mark-Simulacrum merged commit d502577 into aws:main Oct 14, 2025
121 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants