Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: do not finish initialization on stream disconnection #7427

Merged
merged 20 commits into from
Aug 2, 2019

Conversation

ramaraochavali
Copy link
Contributor

Description: As part of #6151 we ensured that envoy initialization would not finish till a named response comes. I found that when Envoys sends EDS request for a cluster and the management server is disconnected/reconnected, Envoy proceeds with the initialization even if named response is not sent. That is because stream disconnection triggers a onConfigUpdate callback and we call onPreInitComplete when we get onConfigUpdate callback. This PR ensures that we don;'t call onPreInitComplete on stream disconnects.
Risk Level: Low
Testing: Added a case to test this in the the existing test case.
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: asan (failed build)

🐱

Caused by: a #7427 (comment) was created by @ramaraochavali.

see: more, trace.

@ramaraochavali
Copy link
Contributor Author

@mattklein123 @htuch Please see if this makes sense

@mattklein123 mattklein123 self-assigned this Jul 1, 2019
@mattklein123
Copy link
Member

IMO this is not safe, since a bad management server can cause Envoy to hang. It seems like maybe you want some kind of feature where you can have some number of retries during init before you give up?

@ramaraochavali
Copy link
Contributor Author

@mattklein123 But the problem is an accidental network disconnection can wipe out the cluster members. Also I thought initial_fetch_timeout is meant to do that safety handling i.e. when no response comes with in the time, it just moves ahead with the initialization.

But looking at the initial fetch timeout timer callback implementation, unfortunately it is just passing nullptr assuming that onConfigUpdate would move forward with init which would be an issue with this change.

callbacks_.onConfigUpdateFailed(nullptr);

Should we pass a valid exception like TimeoutException here instead of nullptr - so that initial_fetch_timeout can be used as a safety belt here ?

@mattklein123
Copy link
Member

@ramaraochavali my general feeling is that we need to have sane defaults. Personally, I would be in favor of making a default initial fetch timeout if none is specified. Perhaps 15s or so? If we have a default in place, as well as good documentation around this, I would be more inclined to do a change like this. @htuch WDYT?

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@stale
Copy link

stale bot commented Jul 8, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 8, 2019
@ramaraochavali
Copy link
Contributor Author

@mattklein123 I agree we should have sane defaults for initial_fetch_timout. There is inconsistency in the initialization behaviour when initial_fetch_timeout is not configured i.e if management server is connected and do not send any response, the initialzation process would wait (for whatever time it takes) and if there is an accidental network disconnection it would just finish the initialization process. So I think it is related but slightly different problem.

However I see your point about having some sane defaults for initial_fetch_timout` so that we are guaranteed not to hang in all cases.

So should I first push another PR to just change the initial_fetch_timeout to have default of 15s?

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 9, 2019
@mattklein123
Copy link
Member

So should I first push another PR to just change the initial_fetch_timeout to have default of 15s?

IMO yes. @htuch any thoughts here?

@htuch
Copy link
Member

htuch commented Jul 9, 2019

What is the underlying temporal property we want here? Is it something like "When an Envoy is initialized, it will have the complete state from the management server of all its APIs" or more like "An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"?

Do we want to support both models? Either way, any PR should probably elaborate on the xDS docs and explain what the key properties we're shooting for here.

@ramaraochavali
Copy link
Contributor Author

I think @mattklein123 is leaning towards ""An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"? which seems reasonable to me. Is that correct @mattklein123 ?

@mattklein123
Copy link
Member

Yeah that is my thinking.

@htuch
Copy link
Member

htuch commented Jul 10, 2019

That's fair, I think this needs to be documented in the server initialization docs, since this is a key principle we need to respect going forward.

@stale
Copy link

stale bot commented Jul 17, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@ramaraochavali
Copy link
Contributor Author

This is waiting on this PR #7571

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 18, 2019
@stale
Copy link

stale bot commented Jul 30, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. At a high level I think this change makes sense, but I would like to work on making the control flow more clear and less hacky. I'm going to also assign @htuch to review as he has worked on this code a lot more. Thank you!

/wait

ENVOY_LOG(warn, "gRPC config: initial fetch timed out for {}", type_url_);
callbacks_.onConfigUpdateFailed(nullptr);
try {
throw EnvoyException("initial fetch timed out");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to allocate an EnvoyException on the stack and pass it into the update failed function.

// TODO(htuch): Less fragile signal that this is failure vs. reject.
if (e == nullptr) {
stats_.update_failure_.inc();
ENVOY_LOG(debug, "gRPC update for {} failed", type_url_);
} else {
// fetch timeout should be disabled only when the actual timeout happens - not on network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very fragile to me that we are using exceptions in this way to figure out a timeout vs. not, etc. Can you rework this code to make the control flow a lot more obvious? It's possible that you might need to rework how the update failed functions work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is fragile. I did not try to change that because it triggers more changes because of how the update failed functions work today and this existed earlier as well. But it would be good to clean this up. Let me try.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
…ction

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@mattklein123 @htuch Added ConfigUpdateFailedReason enum and changed the flow. PTAL. envoy-macos pipeline failure is not related to this PR? Can you trigger that build again?

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo a comment. This is a nice improvement to the understandability of config update failure!
/wait

void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason,
const EnvoyException* e) {
// We should not call onPreInitComplete if this method called because of stream disconnection.
if (e == nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the ConfigUpdateFailureReason here rather than the indirect e == nullptr? This seems to be one of the few places (and motivating example) there is benefit from plumbing this reason in to the subscriber callbacks, but we 're not using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Catch @htuch. I meant to change this and forgot. Thanks. Changed now, PTAL.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
mattklein123
mattklein123 previously approved these changes Jul 31, 2019
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is a great improvement. A few small comments and will defer to @htuch for further review.

} else {
ENVOY_LOG(warn, "Filesystem config update failure: {}", e.what());
stats_.update_failure_.inc();
callbacks_.onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a strange error code to use here, but I understand why you did it. Maybe a small TODO/comment?

break;
case Envoy::Config::ConfigUpdateFailureReason::UpdateRejected:
// We expect Envoy exception to be thrown when update is rejected.
ASSERT(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: e != nullptr

stats_.update_rejected_.inc();
ENVOY_LOG(warn, "gRPC config for {} rejected: {}", type_url_, e->what());
break;
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this default case should not be needed.

UNREFERENCED_PARAMETER(e);
void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason reason,
const EnvoyException*) {
// We should not call onPreInitComplete if this method is called because of stream disconnection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here that this might hang init forever if the user has disabled the init timeout? Might be worth a TODO to warn in this case?

@mattklein123 mattklein123 removed their assignment Jul 31, 2019
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@mattklein123 @htuch addressed the feedback. PTAL.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@htuch merged master. PTAL

@ramaraochavali
Copy link
Contributor Author

@htuch @mattklein123 can this be merged now tests have passed, it might get in to master merge issue because of the number of files?

@mattklein123 mattklein123 merged commit 0957e9c into envoyproxy:master Aug 2, 2019
@ramaraochavali ramaraochavali deleted the fix/init_disconnection branch August 3, 2019 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants