-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config: do not finish initialization on stream disconnection #7427
config: do not finish initialization on stream disconnection #7427
Conversation
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
/retest |
🔨 rebuilding |
@mattklein123 @htuch Please see if this makes sense |
IMO this is not safe, since a bad management server can cause Envoy to hang. It seems like maybe you want some kind of feature where you can have some number of retries during init before you give up? |
@mattklein123 But the problem is an accidental network disconnection can wipe out the cluster members. Also I thought initial_fetch_timeout is meant to do that safety handling i.e. when no response comes with in the time, it just moves ahead with the initialization. But looking at the initial fetch timeout timer callback implementation, unfortunately it is just passing
Should we pass a valid exception like |
@ramaraochavali my general feeling is that we need to have sane defaults. Personally, I would be in favor of making a default initial fetch timeout if none is specified. Perhaps 15s or so? If we have a default in place, as well as good documentation around this, I would be more inclined to do a change like this. @htuch WDYT? |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
@mattklein123 I agree we should have sane defaults for However I see your point about having some sane defaults for initial_fetch_timout` so that we are guaranteed not to hang in all cases. So should I first push another PR to just change the |
IMO yes. @htuch any thoughts here? |
What is the underlying temporal property we want here? Is it something like "When an Envoy is initialized, it will have the complete state from the management server of all its APIs" or more like "An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"? Do we want to support both models? Either way, any PR should probably elaborate on the xDS docs and explain what the key properties we're shooting for here. |
I think @mattklein123 is leaning towards ""An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"? which seems reasonable to me. Is that correct @mattklein123 ? |
Yeah that is my thinking. |
That's fair, I think this needs to be documented in the server initialization docs, since this is a key principle we need to respect going forward. |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This is waiting on this PR #7571 |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. At a high level I think this change makes sense, but I would like to work on making the control flow more clear and less hacky. I'm going to also assign @htuch to review as he has worked on this code a lot more. Thank you!
/wait
ENVOY_LOG(warn, "gRPC config: initial fetch timed out for {}", type_url_); | ||
callbacks_.onConfigUpdateFailed(nullptr); | ||
try { | ||
throw EnvoyException("initial fetch timed out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to allocate an EnvoyException on the stack and pass it into the update failed function.
// TODO(htuch): Less fragile signal that this is failure vs. reject. | ||
if (e == nullptr) { | ||
stats_.update_failure_.inc(); | ||
ENVOY_LOG(debug, "gRPC update for {} failed", type_url_); | ||
} else { | ||
// fetch timeout should be disabled only when the actual timeout happens - not on network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very fragile to me that we are using exceptions in this way to figure out a timeout vs. not, etc. Can you rework this code to make the control flow a lot more obvious? It's possible that you might need to rework how the update failed functions work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this is fragile. I did not try to change that because it triggers more changes because of how the update failed functions work today and this existed earlier as well. But it would be good to clean this up. Let me try.
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
…ction Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@mattklein123 @htuch Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo a comment. This is a nice improvement to the understandability of config update failure!
/wait
source/common/upstream/eds.cc
Outdated
void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason, | ||
const EnvoyException* e) { | ||
// We should not call onPreInitComplete if this method called because of stream disconnection. | ||
if (e == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the ConfigUpdateFailureReason here rather than the indirect e == nullptr
? This seems to be one of the few places (and motivating example) there is benefit from plumbing this reason in to the subscriber callbacks, but we 're not using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Catch @htuch. I meant to change this and forgot. Thanks. Changed now, PTAL.
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is a great improvement. A few small comments and will defer to @htuch for further review.
} else { | ||
ENVOY_LOG(warn, "Filesystem config update failure: {}", e.what()); | ||
stats_.update_failure_.inc(); | ||
callbacks_.onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a strange error code to use here, but I understand why you did it. Maybe a small TODO/comment?
break; | ||
case Envoy::Config::ConfigUpdateFailureReason::UpdateRejected: | ||
// We expect Envoy exception to be thrown when update is rejected. | ||
ASSERT(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: e != nullptr
stats_.update_rejected_.inc(); | ||
ENVOY_LOG(warn, "gRPC config for {} rejected: {}", type_url_, e->what()); | ||
break; | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this default case should not be needed.
UNREFERENCED_PARAMETER(e); | ||
void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason reason, | ||
const EnvoyException*) { | ||
// We should not call onPreInitComplete if this method is called because of stream disconnection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here that this might hang init forever if the user has disabled the init timeout? Might be worth a TODO to warn in this case?
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@mattklein123 @htuch addressed the feedback. PTAL. |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@htuch merged master. PTAL |
@htuch @mattklein123 can this be merged now tests have passed, it might get in to master merge issue because of the number of files? |
Description: As part of #6151 we ensured that envoy initialization would not finish till a named response comes. I found that when Envoys sends EDS request for a cluster and the management server is disconnected/reconnected, Envoy proceeds with the initialization even if named response is not sent. That is because stream disconnection triggers a
onConfigUpdate
callback and we callonPreInitComplete
when we getonConfigUpdate
callback. This PR ensures that we don;'t callonPreInitComplete
on stream disconnects.Risk Level: Low
Testing: Added a case to test this in the the existing test case.
Docs Changes: N/A
Release Notes: N/A