cluster manager: initialization cleanups #14382

mattklein123 · 2020-12-11T23:04:01Z

Final follow up from #13906. This PR does:

Simplify the logic during startup by making thread local clusters
only appear after a cluster has been initialized. This is now uniform
both for bootstrap clusters as well as CDS clusters, making the logic
simpler to follow.
Aggregate cluster needed fixes due to assumptions on startup
existence of the thread local cluster. This change also
fixes Can't initializate Envoy v1.16+ with CDS message #14119
Make TLS mocks verify that set() is called before other functions.

Risk Level: Medium. Scary startup stuff.
Testing: Existing and fixed tests.
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Final follow up from #13906. This PR does: 1) Simplify the logic during startup by making thread local clusters only appear after a cluster has been initialized. This is now uniform both for bootstrap clusters as well as CDS clusters, making the logic simpler to follow. 2) Aggregate cluster needed fixes due to assumptions on startup existence of the thread local cluster. This change also fixes #14119 3) Make TLS mocks verify that set() is called before other functions. Signed-off-by: Matt Klein <mklein@lyft.com>

Signed-off-by: Matt Klein <mklein@lyft.com>

lizan · 2020-12-12T08:18:40Z

source/extensions/clusters/aggregate/cluster.h

+
+    // For aggregate cluster the per-thread LB is only created once. We need to own it so we
+    // can pre-populate it before the LB is created and handed to the cluster.
+    absl::variant<std::unique_ptr<AggregateClusterLoadBalancer>, AggregateClusterLoadBalancer*> lb_;


Hmm, not very happy with this "maybe owned" semantics here. In my debug yesterday seems we may delay the call to refresh here to make sure LB is created and handed before thread local cluster is updated.

I thought about this a fair amount. The issue is there is no dependency order between clusters so we have to be very careful to not lose updates. I think it may be possible to move all of the logic to the thread local cluster updates (off of the main thread) and initialize first during LB creation on each worker. My opinion though is we should merge this since it will work and people are complaining about this and I can circle back. But up to you. I can try to refactor further.

I looked into this a little more and it's pretty tricky because the thread local load balancer is created in the constructor for the thread local cluster, so the cluster won't exist at that point. I can probably make this better but it's not easy and I recommend going with this for now. Let me know what you think.

Actually, sorry, nevermind, we don't need to look up ourself, just other clusters. Let me see what I can do.

I've spent a bit of time trying to implement the alternate version and it's not so easy. I will see what I can do but I would still recommend we go with this for now and I will try to replace it with something better.

OK, yeah agreed it is tricky. I'm ok with going with this for now. Do we want backport this?

@lizan I opened #14388 which is a trivial fix and will be easy to backport. The more complicated fix here is required because of the unrelated changes in cluster_manager_impl. I will work on a better version of this PR in parallel and it will not need to be backported.

/wait

Follow up to #14382. Remove TLS use in aggregate cluster. Move all logic into the thread local load balancers making the implementation less brittle and easier to understand. Signed-off-by: Matt Klein <mklein@lyft.com>

mattklein123 requested review from lizan and snowp as code owners December 11, 2020 23:04

mattklein123 assigned lizan and snowp Dec 11, 2020

fix

b5d21f1

Signed-off-by: Matt Klein <mklein@lyft.com>

lizan reviewed Dec 12, 2020

View reviewed changes

lizan approved these changes Dec 13, 2020

View reviewed changes

mattklein123 mentioned this pull request Dec 13, 2020

aggregate cluster: fix TLS init issue #14388

Closed

repokitteh-read-only bot added the waiting label Dec 13, 2020

mattklein123 merged commit 0e6047b into master Dec 14, 2020

mattklein123 deleted the local_cluster_cleanups branch December 14, 2020 17:39

mattklein123 mentioned this pull request Dec 15, 2020

aggregate cluster: cleanups #14411

Merged

tbarrella mentioned this pull request Dec 16, 2020

istio-proxy v1.8.0-alpha2 crashing on aggregate cluster istio/istio#28620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster manager: initialization cleanups #14382

cluster manager: initialization cleanups #14382

mattklein123 commented Dec 11, 2020

lizan Dec 12, 2020

mattklein123 Dec 12, 2020

mattklein123 Dec 12, 2020

mattklein123 Dec 12, 2020

mattklein123 Dec 12, 2020

lizan Dec 13, 2020

mattklein123 Dec 13, 2020

cluster manager: initialization cleanups #14382

cluster manager: initialization cleanups #14382

Conversation

mattklein123 commented Dec 11, 2020

lizan Dec 12, 2020

Choose a reason for hiding this comment

mattklein123 Dec 12, 2020

Choose a reason for hiding this comment

mattklein123 Dec 12, 2020

Choose a reason for hiding this comment

mattklein123 Dec 12, 2020

Choose a reason for hiding this comment

mattklein123 Dec 12, 2020

Choose a reason for hiding this comment

lizan Dec 13, 2020

Choose a reason for hiding this comment

mattklein123 Dec 13, 2020

Choose a reason for hiding this comment