-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health_checker: Make health check loop wait for any required SDS secrets to be loaded before starting. #16236
Conversation
This commit is a revival of envoyproxy#13516 which aimed to solve envoyproxy#12389. That PR was merged and then reverted due to a deadlock bug, which is still present in this commit. The fix will be applied in a subsequent commit in order to make the resolution clearer to PR reviewers. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
The bug was caused by not queuing callbacks in addReadyCb() while holding the lock. This means that you could make the decision to NOT run the callback, get a secret concurrently which would run all callbacks, and then add it to the queue at which point it would never run. This also changed the name of addReadyCb() to addHealthCheckingReadyCb() on HostImpl, and made the logic add a callback for the health checking transport socket. That's the socket that we actually want health checks to wait on which may be different from the data plane socket. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
For reviewers: the first commit is basically just a revert-of-a-revert of #13516 and then rebased onto The second commit has the fix, and then the Run on my local machine demonstrating the bug:
And after it was fixed:
|
// catch a race condition introduced in | ||
// https://github.com/envoyproxy/envoy/pull/13516 where a callback would never | ||
// be called due to a concurrency bug. | ||
TEST_P(SslSocketTest, ServerAddSecretsReadyCallbackParallel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the next test are probably a bit odd since they start a thread to try to tickle a concurrency bug that has been fixed now, and only happened in ~5% of runs on my machine. This means if there is a regression for some reason we might not see it in the PR that introduces it.
Let me know if there is a better way of handling tests like this
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
// 1 microsecond of sleep seems to be a sweet spot for giving the addReadyCb | ||
// thread enough of a head start to reliably cause the test to fail with buggy | ||
// code. | ||
std::this_thread::sleep_for(std::chrono::microseconds(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm failing the style check because of this, I'm looking into whether the issue can be reproduced with ThreadSynchronizer instead (thanks Greg for suggesting that on Slack).
It's a bit weird because with the buggy code I certainly can reproduce it 100% of the time with that, but then when it is fixed there is nowhere I can put the syncPoint() that would let me deadlock it (because the syncpoint would need to be put in a place where the mutex is held).
All of this makes me wonder if this test isn't worth checking in at all: maybe the value was just in demonstrating the bug and that it was fixed, but it's not so useful as a regression test since it's testing a very specific concurrency bug?
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
/assign-from @envoyproxy/first-pass-reviewers |
@envoyproxy/first-pass-reviewers assignee is @antoniovicente |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes to SDS handling and health checks.
I'ld like to better understand threading involved in SDS updates vs health checkers. Please see comments below.
if (ssl_ctx_) { | ||
immediately_run_callback = true; | ||
} else { | ||
secrets_ready_callbacks_.push_back(callback); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What mutex guards secrets_ready_callbacks_ ?
You only have a reader lock on ssl_ctx_mu_ here, I think you need a write lock.
@@ -151,6 +156,7 @@ class ServerSslSocketFactory : public Network::TransportSocketFactory, | |||
const std::vector<std::string> server_names_; | |||
mutable absl::Mutex ssl_ctx_mu_; | |||
Envoy::Ssl::ServerContextSharedPtr ssl_ctx_ ABSL_GUARDED_BY(ssl_ctx_mu_); | |||
std::list<std::function<void()>> secrets_ready_callbacks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add ABSL_GUARDED_BY
std::list<std::function<void()>> secrets_ready_callbacks_ ABSL_GUARDED_BY(ssl_ctx_mu_);
if (ssl_ctx_) { | ||
immediately_run_callback = true; | ||
} else { | ||
secrets_ready_callbacks_.push_back(callback); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no mechanism to cancel pending callbacks on ActiveHealthCheckSession deletion. Is it possible for it to take a long time for secrets to become available, and for the ActiveHealthCheckSession to be deleted before this callback is executed?
@@ -121,6 +121,9 @@ class TsiSocketFactory : public Network::TransportSocketFactory { | |||
Network::TransportSocketPtr | |||
createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override; | |||
|
|||
// TODO(mpuncel) only invoke callback() once secrets are ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC TSI socket does not use SDS to propagate credentials. I think that the current implementation is good enough.
@@ -82,6 +82,7 @@ class ServerStartTlsSocketFactory : public Network::TransportSocketFactory, | |||
createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override; | |||
bool implementsSecureTransport() const override { return false; } | |||
bool usesProxyProtocolOptions() const override { return false; } | |||
void addReadyCb(std::function<void()> callback) override { callback(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO(mpuncel) only invoke callback() once secrets are ready.
may be appropriate in this case.
@@ -50,6 +50,7 @@ class UpstreamProxyProtocolSocketFactory : public Network::TransportSocketFactor | |||
createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override; | |||
bool implementsSecureTransport() const override; | |||
bool usesProxyProtocolOptions() const override { return true; } | |||
void addReadyCb(std::function<void()> callback) override { callback(); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO(mpuncel) only invoke callback() once secrets are ready.
may be appropriate in this case.
} | ||
if (should_run_callbacks) { | ||
for (const auto& cb : secrets_ready_callbacks_) { | ||
cb(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On what thread to these callbacks run?
I don't know what the threading model looks like for health checker vs SDS. I think there's potential for callbacks being invoked from the wrong thread.
} | ||
} | ||
if (should_run_callbacks) { | ||
for (const auto& cb : secrets_ready_callbacks_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what threads are callbacks added to this list? Do adds happen from outside the thread where onAddOrUpdateSecret() is called?
thanks for the feedback, I'll hopefully be able to address it in a couple days |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Commit Message: Make health check loop wait for any required SDS secrets to be loaded before starting.
Additional Description: This should avoid an issue where Envoy might take over a minute to warm clusters with default settings when SDS and active health checking are used. The issue was caused by health checks starting before secrets are ready and then waiting for
no_traffic_interval
(default 60s) before trying again. This change makes health checks wait for SDS secrets to be ready before starting.Risk Level: Medium
Testing: Concurrency test covers the bug that caused this to be reverted initially, and was tested locally with --runs_per_test 100, confirming test failures when the bug is present and none when it was fixed.
Docs Changes: None
Release Notes: Included
Platform Specific Features: None
Fixes #12389, #15977