Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecds: rationalize ownership #19630

Merged
merged 9 commits into from
Feb 14, 2022
Merged

ecds: rationalize ownership #19630

merged 9 commits into from
Feb 14, 2022

Conversation

kyessenov
Copy link
Contributor

@kyessenov kyessenov commented Jan 20, 2022

Signed-off-by: Kuat Yessenov kuat@google.com

Commit Message: Rationalize ownership of config providers and subscriptions.
Additional Description: This should fix issue #14930.

The root cause is unclear ownership of config between listeners and subscriptions (which are shared and operate separately).
The bug is triggered when multiple churning listeners reference the same ECDS resource, while at the same time ECDS pushes updates to it.

This PR fixes most issues discovered:

  • Ensure that the config subscription only references the server factory context, not the listener context, as the listener context may go away while the subscription is active.
  • Ensure that the asynchronous onConfigUpdate call on a worker does not reference the dynamic config provider as it is owned by a listener and may go away after onConfigUpdate but before all workers process the event. The solution is to create a shared pointer intermediary that can outlast the dynamic config provider.
  • Ensure that the filter factory is always deleted on main. Regular thread local destructor does not guarantee that. The solution is to issue an empty update to TLS during the destructor. This requires exposing the isShutdown accessor to avoid an assertion failure during process shutdown.
  • Move the terminal filter detection to the provider from the subscriptiion. This is because the terminal filter condition on a factory depends on the listener context (unlikely to be used, but is a public filter interface).

Risk Level: medium
Testing: black box integration test over long time with racey xDS, unit, added integration test
Docs Changes: none
Release Notes: none

Signed-off-by: Kuat Yessenov <kuat@google.com>
@kyessenov
Copy link
Contributor Author

/cc @tbarrella @bianpengyuan

Signed-off-by: Kuat Yessenov <kuat@google.com>
@lambdai
Copy link
Contributor

lambdai commented Jan 20, 2022

Several high level questions

  • Ensure that the config subscription only references the server factory context, not the listener context, as the listener context may go away while the subscription is active.

From the perspective of lifetime, this is the right move. It's a interface change to the component which may use context. I guess if the compile succeeds it should be ok? There is some stats scope change along with the context. Perhaps it should be reflected in the release log.

  • Ensure that the filter factory is always deleted on main. Regular thread local destructor does not guarantee that. The solution is to issue an empty update to TLS during the destructor. This requires exposing the isShutdown accessor to avoid an assertion failure during process shutdown.

isShutdown seems right in this case. However, there is another corner case that is unrelated to shutdown:
The TLS can be destroyed before the lambda that executing on each worker thread. Is the filter factory deleted on the main thread in this case?

@kyessenov
Copy link
Contributor Author

From the perspective of lifetime, this is the right move. It's a interface change to the component which may use context. I guess if the compile succeeds it should be ok? There is some stats scope change along with the context. Perhaps it should be reflected in the release log.

Would it be fair to add that ECDS stats now bind to server scope so are not cleared when listeners are removed?

isShutdown seems right in this case. However, there is another corner case that is unrelated to shutdown: The TLS can be destroyed before the lambda that executing on each worker thread. Is the filter factory deleted on the main thread in this case?

runOnAllThreads ensures that the last call is always on main by posting to main from last deletion of shared pointer on a worker. Because we pass the main config as part of the call, it shouldn't matter whether backing TLS is gone?

@lambdai
Copy link
Contributor

lambdai commented Jan 20, 2022

Would it be fair to add that ECDS stats now bind to server scope so are not cleared when listeners are removed?

I don't have the answer.
Server scope is easier to implemented correctly.
Per listener scope is probably what user wants to see.
I don't understand which is chosen in this PR right now...

@kyessenov
Copy link
Contributor Author

kyessenov commented Jan 20, 2022

Would it be fair to add that ECDS stats now bind to server scope so are not cleared when listeners are removed?

I don't have the answer. Server scope is easier to implemented correctly. Per listener scope is probably what user wants to see. I don't understand which is chosen in this PR right now...

Scopes are derived from server factory context in this PR, so it's not tied to listener one. However, we still carry on stats_prefix from the listener even when listeners for the subscription change. I suppose we can change that so that the stat prefix for ECDS is completely decoupled from listeners.

@kyessenov kyessenov changed the title ecds: rationalize ownership (WIP) ecds: rationalize ownership Jan 20, 2022
Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
@lizan
Copy link
Member

lizan commented Jan 20, 2022

/assign-from @envoyproxy/first-pass-reviewers

@repokitteh-read-only
Copy link

@envoyproxy/first-pass-reviewers assignee is @adisuissa

🐱

Caused by: a #19630 (comment) was created by @lizan.

see: more, trace.

@lambdai
Copy link
Contributor

lambdai commented Jan 20, 2022

isShutdown seems right in this case. However, there is another corner case that is unrelated to shutdown: The TLS can be destroyed before the lambda that executing on each worker thread. Is the filter factory deleted on the main thread in this case?

runOnAllThreads ensures that the last call is always on main by posting to main from last deletion of shared pointer on a worker. Because we pass the main config as part of the call, it shouldn't matter whether backing TLS is gone?

You are right. The on-main-thread callback is not associated with the slot. Sorry for the churn

Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
@mattklein123
Copy link
Member

@adisuissa can you review this one please?

@adisuissa
Copy link
Contributor

Thanks for working on this @kyessenov!
I'm still looking through the changes and trying to get a better understanding of the subtleties of ECDS, and will give my comments tomorrow.

BTW: I think that this PR also fixes #19628.

@kyessenov
Copy link
Contributor Author

Gentle ping for a review. This is destabilizing Wasm APIs in Istio.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!
Overall I agree with the change, but I have a question regarding the shared pointer.

@@ -160,8 +151,7 @@ void FilterConfigSubscription::onConfigUpdate(
last_config_hash_ = 0;
last_config_ = nullptr;
last_type_url_ = "";
last_filter_is_terminal_ = false;
last_filter_name_ = "";
last_factory_name_ = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add last_version_info_ = "";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ASSERT_TRUE(response->complete());
EXPECT_EQ("200", response->headers().getStatusValue());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: codec_client_->close();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


const std::string stat_prefix_;
Server::Configuration::FactoryContext& factory_context_;
std::shared_ptr<MainConfig> main_config_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that shared_ptr makes a difference in this case.
IIUC the only update is on the pointed object (current_config_), which would be overrun whether this is a shared pointer or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During destruction, main_config is shared with the runOnAllThreads callbacks which guarantees that the last reference is deleted on the main thread. That ensures that the inner config is always deleted on the main thread, which is necessary if the inner config has its own TLS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarification.

@@ -731,5 +731,59 @@ TEST_P(ExtensionDiscoveryIntegrationTest, BasicFailTerminalFilterNotAtEndOfFilte
EXPECT_EQ("500", response->headers().getStatusValue());
}

// Validate that deleting listeners does not break active ECDS subscription.
TEST_P(ExtensionDiscoveryIntegrationTest, ReloadBoth) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is not part of this PR, but I think that the current integration test doesn't exercise the onConfigRemoved path because the test doesn't use delta-xDS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, let's do this separately since this looks like some existing debt.

Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Kuat Yessenov <kuat@google.com>
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
/assign-from @envoyproxy/senior-maintainers


const std::string stat_prefix_;
Server::Configuration::FactoryContext& factory_context_;
std::shared_ptr<MainConfig> main_config_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarification.

@repokitteh-read-only
Copy link

@envoyproxy/senior-maintainers assignee is @snowp

🐱

Caused by: a #19630 (review) was submitted by @adisuissa.

see: more, trace.

@kyessenov
Copy link
Contributor Author

/retest
(timeout)

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #19630 (comment) was created by @kyessenov.

see: more, trace.

@kyessenov
Copy link
Contributor Author

Another gentle ping for a crash fix.

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@snowp snowp merged commit cc089aa into envoyproxy:main Feb 14, 2022
dubious90 pushed a commit to envoyproxy/nighthawk that referenced this pull request Feb 15, 2022
- Update `ENVOY_COMMIT` to the latest Envoy's commit envoyproxy/envoy#19630
- Sync the `.bazelrc` to the current Envoy's version.
- Update pool stream interface according to envoyproxy/envoy#19388 using reasonable defaults. H3 early send feature maybe adopted in the future, but is not in the scope of this PR.
- Added a hack to prevent integration tests from failing due to a non-empty error log. logging introduced in envoyproxy/envoy@027d6ca  appears to be purely informational and should be fixed upstream envoyproxy/envoy#19964

Signed-off-by: tomjzzhang <4367421+tomjzzhang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants