-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploy split proxy/auth with helm chart #18857
Conversation
3c762c9
to
6a00b79
Compare
958d896
to
fb912cc
Compare
examples/chart/teleport-cluster/templates/auth/_config.common.tpl
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/templates/auth/_config.common.tpl
Outdated
Show resolved
Hide resolved
Upgrading from a past chart caused ~3 min of downtime, mainly caused by:
I'm not sure how to minimize this. One could deploy a second release next to the original one, then changing the DNS record to point to the new LB. The biggest issue would be reverse-tunnel clients trying to connect to unreachable proxy, but this might be acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good overall. I think the new templating style is going to give people a lot more flexibility.
We might want to update the README a bit to remove the values listing and point people to the comprehensive chart reference too... which unfortunately is also going to need overhauling as part of the changes 😢
examples/chart/teleport-cluster/templates/auth/_config.common.tpl
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/templates/auth/_config.scratch.tpl
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/templates/auth/statefulset.yaml
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/templates/proxy/_config.scratch.tpl
Outdated
Show resolved
Hide resolved
013f71f
to
d397909
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
examples/chart/teleport-cluster/templates/auth/_config.common.tpl
Outdated
Show resolved
Hide resolved
f844519
to
620b06f
Compare
examples/chart/teleport-cluster/templates/proxy/_config.scratch.tpl
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/templates/proxy/_config.scratch.tpl
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/templates/auth/_config.scratch.tpl
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/tests/__snapshot__/auth_config_test.yaml.snap
Outdated
Show resolved
Hide resolved
Part of [RFD-0096](#18274) This PR adds helm hooks deploying a test configuration job and running `teleport configure --test` to validate the `teleport.yaml` configuration is sane.
c83b7d1
to
3ec9f8f
Compare
Documentation PR is here: #19881 I'll wait until there are no more comments on this one to rebase and send it for review to the docs team. |
# TLS multiplexing is not supported when using ACM+NLB for TLS termination. | ||
# | ||
# Possible values are 'separate' and 'multiplex' | ||
proxyListenerMode: "separate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the default value be multiplex
nowadays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more people are terminating TLS in front of Teleport than not when deploying in Kubernetes. The use of ACM is widespread, and the lack of Ingress support is a huge gripe people have with the Helm charts in general. As such I think it's best if we're still explicit about requiring the use of separate ports/listeners until such time as Teleport can work reliably behind TLS termination. If we default to multiplex
here we're just making a rod for our own backs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If/when we'll manage to have Teleport run behind an Ingress we'll be able to multiplex by default on those setups. Based on support requests a lot of people are using ACM-based setups and switching to multiplex will break their setups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related issue: #19975
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
examples/chart/teleport-cluster/templates/auth/_config.scratch.tpl
Outdated
Show resolved
Hide resolved
examples/chart/teleport-cluster/templates/proxy/_config.scratch.tpl
Outdated
Show resolved
Hide resolved
Following discussions, I searched GitHub issues and learned Teleport had several issues with stale proxy/auth nodes in the past. I'll run a couple more tests to observe how the cluster reacts during rollouts and validate that the topology change does not introduce a regression/make things worse. |
With split auth/proxyThe auth rollout took ~5min for the metrics to go back to normal The proxy rollout happened really fast, and all nodes reconnected real quick. However, CPU usage increased dramatically after the rollout and did not go back to regular levels (for proxy, nodes, and a bit for auth). According to the logs, the nodes were discovering 3 or 4 proxies for 30 minutes. Once the nodes stopped discovering stale proxies, the CPU usage went back to nominal usage. With auth and proxy bundledThe proxy rollout issue also happens Split impactWe were already facing facing the issue #20057 in bundled mode. This is non-blocking. Splitting auth and proxies cause auth rollouts to disconnect every node. They can take up to 5 minutes to reconnect. Most of the wait does look like a bug though: #8793 While the auth rollout impact is not ideal, this does not seem to be a blocker. We might want to investigate why reconnection is delayed, so the update experience becomes smoother. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bot
This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```
This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```
…20449) This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```
This PR implements big chunks of #18274
The following changes are implemented in the PR:
kubernetes
joinMethod to join proxiesThe following changes have been implemented and reviewed in sub-PRs:
The following changes will be implemented later, in subsequent PRs:
The following changes will not be implemented and the RFD will be modified to reflect this change:
I am not happy with the state of the tests, but the PR is already too large. I would like to revamp the tests in a subsequent PR.
Before merging I want to run the following tests:
Should address:
auth
andproxy
pods #16871teleport-cluster
helm chart #16096 (although it should be tested and we might want to add a containerPort to officialise proxy-peering support)