-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster manager: initialization cleanups #14382
Conversation
Final follow up from #13906. This PR does: 1) Simplify the logic during startup by making thread local clusters only appear after a cluster has been initialized. This is now uniform both for bootstrap clusters as well as CDS clusters, making the logic simpler to follow. 2) Aggregate cluster needed fixes due to assumptions on startup existence of the thread local cluster. This change also fixes #14119 3) Make TLS mocks verify that set() is called before other functions. Signed-off-by: Matt Klein <mklein@lyft.com>
|
||
// For aggregate cluster the per-thread LB is only created once. We need to own it so we | ||
// can pre-populate it before the LB is created and handed to the cluster. | ||
absl::variant<std::unique_ptr<AggregateClusterLoadBalancer>, AggregateClusterLoadBalancer*> lb_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not very happy with this "maybe owned" semantics here. In my debug yesterday seems we may delay the call to refresh
here to make sure LB is created and handed before thread local cluster is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this a fair amount. The issue is there is no dependency order between clusters so we have to be very careful to not lose updates. I think it may be possible to move all of the logic to the thread local cluster updates (off of the main thread) and initialize first during LB creation on each worker. My opinion though is we should merge this since it will work and people are complaining about this and I can circle back. But up to you. I can try to refactor further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a little more and it's pretty tricky because the thread local load balancer is created in the constructor for the thread local cluster, so the cluster won't exist at that point. I can probably make this better but it's not easy and I recommend going with this for now. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sorry, nevermind, we don't need to look up ourself, just other clusters. Let me see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent a bit of time trying to implement the alternate version and it's not so easy. I will see what I can do but I would still recommend we go with this for now and I will try to replace it with something better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah agreed it is tricky. I'm ok with going with this for now. Do we want backport this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up to #14382. Remove TLS use in aggregate cluster. Move all logic into the thread local load balancers making the implementation less brittle and easier to understand. Signed-off-by: Matt Klein <mklein@lyft.com>
Follow up to #14382. Remove TLS use in aggregate cluster. Move all logic into the thread local load balancers making the implementation less brittle and easier to understand. Signed-off-by: Matt Klein <mklein@lyft.com>
Final follow up from #13906. This PR does:
only appear after a cluster has been initialized. This is now uniform
both for bootstrap clusters as well as CDS clusters, making the logic
simpler to follow.
existence of the thread local cluster. This change also
fixes Can't initializate Envoy v1.16+ with CDS message #14119
Risk Level: Medium. Scary startup stuff.
Testing: Existing and fixed tests.
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A