Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDS: destroy cluster info on master thread #14089

Closed
wants to merge 7 commits into from

Conversation

lambdai
Copy link
Contributor

@lambdai lambdai commented Nov 19, 2020

Not ready to commit. Just a strawman.

The goal is to destroy cluster info on master thread by posting to master dispatcher.

See some issues:

  1. cluster info are not guaranteed to be destroyed if post happens immediately after master dispatcher stops. I don't think it is big but we might see that ssl context is not destroyed because the clusterInfo has strong reference( a shared pointer)
  2. master to master post. Not ideal but we need a thread id in dispatcher to avoid master-to-master post.

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fix #13209
[Optional Deprecated:]

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Contributor Author

lambdai commented Nov 19, 2020

CC @mattklein123 for early comment

Comment on lines 908 to 921
[&dispatcher](const ClusterInfoImpl* self) {
FANCY_LOG(debug, "lambdai: schedule destroy cluster info {} on this thread", self->name());
if (!dispatcher.tryPost([self]() {
// TODO(lambdai): Yet there is risk that master dispatcher receives the function but doesn't execute during the shutdown.
// We can either
// 1) Introduce folly::function which supports with unique_ptr capture and destroy cluster info by RAII, or
// 2) Call run post callback in master thread after no worker post back.
FANCY_LOG(debug, "lambdai: execute destroy cluster info {} on this thread. Master thread is expected.", self->name());
delete self;
})) {
FANCY_LOG(debug, "lambdai: cannot post. Has the master thread exited? Executing destroy cluster info {} on this thread.", self->name());
delete self;
}
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post cluster info to master thread by this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also prevent master-master post by returning false

@mattklein123 mattklein123 self-assigned this Nov 19, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level I'm not crazy about this, but maybe it's the only way to fix this issue. Did you look at what it would take to fix the actual issue of why ClusterInfo is storing such complex information that needs to be deleted on the main thread? Can we decouple that somehow? IMO as I mentioned in the linked issue I think there is stuff in ClusterInfo that shouldn't be there?

If we do stick with this approach I left a few comments and this also needs a main merge. Thank you!

/wait

Comment on lines 918 to 919
FANCY_LOG(debug, "lambdai: cannot post. Has the master thread exited? Executing destroy cluster info {} on this thread.", self->name());
delete self;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen? All workers should shut down and join before the main thread finishes running. Even if things are cleaned up after the join, it should be possibly to delete everything on the main thread. Perhaps in this case there needs to be some other cleanup/execution queue for posts that should be run even after the main thread dispatcher has exited?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, possible, I just want to raise the concern here we need some extra shutdown steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The current order is

  1. master stop dispatching
  2. shutdown TLS
  3. stop workers

Master must refuse clusterInfo destroy closure before TLS shutdown(step2). Otherwise the clusterInfo may trigger other TLS op and break TLS.

Alternatively we run some clean up in master queue, and disable TLS during the clean up.

source/common/upstream/cluster_manager_impl.cc Outdated Show resolved Hide resolved
@lambdai
Copy link
Contributor Author

lambdai commented Nov 19, 2020

Edit: add const std::map<std::string, ProtocolOptionsConfigConstSharedPtr> extension_protocol_options_;
The major concerns are

  TransportSocketMatcherPtr socket_matcher_;
  const std::unique_ptr<Server::Configuration::CommonFactoryContext> factory_context_;
  std::vector<Network::FilterFactoryCb> filter_factories_;
  const std::map<std::string, ProtocolOptionsConfigConstSharedPtr> extension_protocol_options_;

@lambdai
Copy link
Contributor Author

lambdai commented Nov 19, 2020

filter_factories and factory_context_(underlying transport socket) are extended to filter impl or customized cluster.

I don't have the full context. It seems listener guaranteed these dependencies destroyed in master thread
under the assumption that connections are destroyed in workers prior to destroying ListenerImpl.

@mattklein123
Copy link
Member

Can you time box trying to dig into the actual underlying problem of the transport socket sharing and whether we can decouple that somehow?

@lambdai
Copy link
Contributor Author

lambdai commented Nov 19, 2020

Can you time box trying to dig into the actual underlying problem of the transport socket sharing and whether we can decouple that somehow?

I did some homework last night and I can add my comments in #13209

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Contributor Author

lambdai commented Nov 21, 2020

Fixing regression

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 23, 2020
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race in callback manager / ClusterInfoImpl destruction
2 participants