CDS: destroy cluster info on master thread #14089

lambdai · 2020-11-19T00:30:31Z

Not ready to commit. Just a strawman.

The goal is to destroy cluster info on master thread by posting to master dispatcher.

See some issues:

cluster info are not guaranteed to be destroyed if post happens immediately after master dispatcher stops. I don't think it is big but we might see that ssl context is not destroyed because the clusterInfo has strong reference( a shared pointer)
master to master post. Not ideal but we need a thread id in dispatcher to avoid master-to-master post.

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fix #13209
[Optional Deprecated:]

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

lambdai · 2020-11-19T00:32:04Z

CC @mattklein123 for early comment

lambdai · 2020-11-19T00:33:58Z

source/common/upstream/upstream_impl.cc

+ [&dispatcher](const ClusterInfoImpl* self) {
+ FANCY_LOG(debug, "lambdai: schedule destroy cluster info {} on this thread", self->name());
+ if (!dispatcher.tryPost([self]() {
+ // TODO(lambdai): Yet there is risk that master dispatcher receives the function but doesn't execute during the shutdown.
+ // We can either 
+ // 1) Introduce folly::function which supports with unique_ptr capture and destroy cluster info by RAII, or
+ // 2) Call run post callback in master thread after no worker post back.
+ FANCY_LOG(debug, "lambdai: execute destroy cluster info {} on this thread. Master thread is expected.", self->name());
+ delete self;
+ })) {
+ FANCY_LOG(debug, "lambdai: cannot post. Has the master thread exited? Executing destroy cluster info {} on this thread.", self->name());
+ delete self;
+ }
+ });


Post cluster info to master thread by this

Also prevent master-master post by returning false

source/common/upstream/cluster_manager_impl.cc

mattklein123

At a high level I'm not crazy about this, but maybe it's the only way to fix this issue. Did you look at what it would take to fix the actual issue of why ClusterInfo is storing such complex information that needs to be deleted on the main thread? Can we decouple that somehow? IMO as I mentioned in the linked issue I think there is stuff in ClusterInfo that shouldn't be there?

If we do stick with this approach I left a few comments and this also needs a main merge. Thank you!

/wait

mattklein123 · 2020-11-19T01:12:51Z

source/common/upstream/upstream_impl.cc

+ FANCY_LOG(debug, "lambdai: cannot post. Has the master thread exited? Executing destroy cluster info {} on this thread.", self->name());
+ delete self;


How can this happen? All workers should shut down and join before the main thread finishes running. Even if things are cleaned up after the join, it should be possibly to delete everything on the main thread. Perhaps in this case there needs to be some other cleanup/execution queue for posts that should be run even after the main thread dispatcher has exited?

Yes, possible, I just want to raise the concern here we need some extra shutdown steps.

Update: The current order is

master stop dispatching

shutdown TLS

stop workers

Master must refuse clusterInfo destroy closure before TLS shutdown(step2）. Otherwise the clusterInfo may trigger other TLS op and break TLS.

Alternatively we run some clean up in master queue, and disable TLS during the clean up.

source/common/upstream/cluster_manager_impl.cc

lambdai · 2020-11-19T03:26:26Z

Edit: add const std::map<std::string, ProtocolOptionsConfigConstSharedPtr> extension_protocol_options_;
The major concerns are

  TransportSocketMatcherPtr socket_matcher_;
  const std::unique_ptr<Server::Configuration::CommonFactoryContext> factory_context_;
  std::vector<Network::FilterFactoryCb> filter_factories_;
  const std::map<std::string, ProtocolOptionsConfigConstSharedPtr> extension_protocol_options_;

lambdai · 2020-11-19T03:50:00Z

filter_factories and factory_context_(underlying transport socket) are extended to filter impl or customized cluster.

I don't have the full context. It seems listener guaranteed these dependencies destroyed in master thread
under the assumption that connections are destroyed in workers prior to destroying ListenerImpl.

mattklein123 · 2020-11-19T16:46:59Z

Can you time box trying to dig into the actual underlying problem of the transport socket sharing and whether we can decouple that somehow?

lambdai · 2020-11-19T17:45:26Z

Can you time box trying to dig into the actual underlying problem of the transport socket sharing and whether we can decouple that somehow?

I did some homework last night and I can add my comments in #13209

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

lambdai · 2020-11-21T00:44:02Z

Fixing regression

github-actions · 2020-12-23T20:05:55Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2020-12-30T20:09:34Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

lambdai added 3 commits November 18, 2020 21:51

destroy hosts on master

b474d23

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

tobetrypost

7b933b1

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

to tryPost

65ed52e

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

lambdai commented Nov 19, 2020

View reviewed changes

source/common/upstream/cluster_manager_impl.cc Outdated Show resolved Hide resolved

mattklein123 self-assigned this Nov 19, 2020

mattklein123 requested changes Nov 19, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 19, 2020

lambdai added 2 commits November 20, 2020 18:24

revert hosts guard

df9beaa

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

Merge branch 'master' into completedestroyonmaster

3af5aac

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

lambdai mentioned this pull request Nov 20, 2020

Data race in callback manager / ClusterInfoImpl destruction #13209

Closed

fix cluster test

b647d0c

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

repokitteh-read-only bot removed the waiting label Nov 20, 2020

fix server fuzz test

8c6266d

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

mattklein123 added the waiting label Nov 23, 2020

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 23, 2020

github-actions bot closed this Dec 30, 2020

lambdai mentioned this pull request Feb 2, 2021

Crash when updating UDP clusters through CDS #14866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDS: destroy cluster info on master thread #14089

CDS: destroy cluster info on master thread #14089

lambdai commented Nov 19, 2020 •

edited

Loading

lambdai commented Nov 19, 2020

lambdai Nov 19, 2020

lambdai Nov 20, 2020

mattklein123 left a comment

mattklein123 Nov 19, 2020

lambdai Nov 20, 2020

lambdai Nov 20, 2020

lambdai commented Nov 19, 2020 •

edited

Loading

lambdai commented Nov 19, 2020

mattklein123 commented Nov 19, 2020

lambdai commented Nov 19, 2020

lambdai commented Nov 21, 2020

github-actions bot commented Dec 23, 2020

github-actions bot commented Dec 30, 2020

		FANCY_LOG(debug, "lambdai: cannot post. Has the master thread exited? Executing destroy cluster info {} on this thread.", self->name());
		delete self;

CDS: destroy cluster info on master thread #14089

CDS: destroy cluster info on master thread #14089

Conversation

lambdai commented Nov 19, 2020 • edited Loading

lambdai commented Nov 19, 2020

lambdai Nov 19, 2020

Choose a reason for hiding this comment

lambdai Nov 20, 2020

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

mattklein123 Nov 19, 2020

Choose a reason for hiding this comment

lambdai Nov 20, 2020

Choose a reason for hiding this comment

lambdai Nov 20, 2020

Choose a reason for hiding this comment

lambdai commented Nov 19, 2020 • edited Loading

lambdai commented Nov 19, 2020

mattklein123 commented Nov 19, 2020

lambdai commented Nov 19, 2020

lambdai commented Nov 21, 2020

github-actions bot commented Dec 23, 2020

github-actions bot commented Dec 30, 2020

lambdai commented Nov 19, 2020 •

edited

Loading

lambdai commented Nov 19, 2020 •

edited

Loading