Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster manager: initialization cleanups #14382

Merged
merged 2 commits into from
Dec 14, 2020
Merged

Conversation

mattklein123
Copy link
Member

Final follow up from #13906. This PR does:

  1. Simplify the logic during startup by making thread local clusters
    only appear after a cluster has been initialized. This is now uniform
    both for bootstrap clusters as well as CDS clusters, making the logic
    simpler to follow.
  2. Aggregate cluster needed fixes due to assumptions on startup
    existence of the thread local cluster. This change also
    fixes Can't initializate Envoy v1.16+ with CDS message #14119
  3. Make TLS mocks verify that set() is called before other functions.

Risk Level: Medium. Scary startup stuff.
Testing: Existing and fixed tests.
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Final follow up from #13906. This PR does:
1) Simplify the logic during startup by making thread local clusters
   only appear after a cluster has been initialized. This is now uniform
   both for bootstrap clusters as well as CDS clusters, making the logic
   simpler to follow.
2) Aggregate cluster needed fixes due to assumptions on startup
   existence of the thread local cluster. This change also
   fixes #14119
3) Make TLS mocks verify that set() is called before other functions.

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>

// For aggregate cluster the per-thread LB is only created once. We need to own it so we
// can pre-populate it before the LB is created and handed to the cluster.
absl::variant<std::unique_ptr<AggregateClusterLoadBalancer>, AggregateClusterLoadBalancer*> lb_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not very happy with this "maybe owned" semantics here. In my debug yesterday seems we may delay the call to refresh here to make sure LB is created and handed before thread local cluster is updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a fair amount. The issue is there is no dependency order between clusters so we have to be very careful to not lose updates. I think it may be possible to move all of the logic to the thread local cluster updates (off of the main thread) and initialize first during LB creation on each worker. My opinion though is we should merge this since it will work and people are complaining about this and I can circle back. But up to you. I can try to refactor further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a little more and it's pretty tricky because the thread local load balancer is created in the constructor for the thread local cluster, so the cluster won't exist at that point. I can probably make this better but it's not easy and I recommend going with this for now. Let me know what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, nevermind, we don't need to look up ourself, just other clusters. Let me see what I can do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent a bit of time trying to implement the alternate version and it's not so easy. I will see what I can do but I would still recommend we go with this for now and I will try to replace it with something better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah agreed it is tricky. I'm ok with going with this for now. Do we want backport this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizan I opened #14388 which is a trivial fix and will be easy to backport. The more complicated fix here is required because of the unrelated changes in cluster_manager_impl. I will work on a better version of this PR in parallel and it will not need to be backported.

/wait

@mattklein123 mattklein123 merged commit 0e6047b into master Dec 14, 2020
@mattklein123 mattklein123 deleted the local_cluster_cleanups branch December 14, 2020 17:39
mattklein123 added a commit that referenced this pull request Dec 15, 2020
Follow up to #14382. Remove TLS use in aggregate cluster. Move all
logic into the thread local load balancers making the implementation
less brittle and easier to understand.

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this pull request Dec 15, 2020
Follow up to #14382. Remove TLS use in aggregate cluster. Move all
logic into the thread local load balancers making the implementation
less brittle and easier to understand.

Signed-off-by: Matt Klein <mklein@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't initializate Envoy v1.16+ with CDS message
3 participants