Deploy split proxy/auth with helm chart #18857

hugoShaka · 2022-11-29T14:05:17Z

This PR implements big chunks of #18274

The following changes are implemented in the PR:

split responsibilities between two pod sets: pods and proxies
use the kubernetes joinMethod to join proxies
have a configuration template per mode: aws, gcp, standalone, scratch
scale proxies by default when they are replicable (cert-manager or user-provided cert through secret, in the future we also might consider pods running behind an ingress as replicable)
merge user-provided configuration with generated configuration
fix initContainer logic (the initContainer field was breaking for more than 1 initContainer)

The following changes have been implemented and reviewed in sub-PRs:

add serviceMonitor for metric collection (see helm: add PodMonitor support #19291)
add an auth initContainer validating if the configuration is valid (see helm: add job validating configuration on deploy #19333)

The following changes will be implemented later, in subsequent PRs:

add configuration examples (see helm: add chart example values #19397)
add a proxy initContainer checking the auth version and blocking rollout (blocked by Add hidden cli command: wait-no-resolve #19277)
mount short-lived tokens on the proxy using projected volumes when possible (here we are using the default bound tokens with a long expiry)
documentation updates: (helm: split proxy/auth documentation #19881)
- getting started guides
- reference
- architecture overview
- changelog

The following changes will not be implemented and the RFD will be modified to reflect this change:

using the operator to create proxy join tokens
adding a field to deploy CRs alongside the deployment

I am not happy with the state of the tests, but the PR is already too large. I would like to revamp the tests in a subsequent PR.

Before merging I want to run the following tests:

Follow local lab getting started with old chart and upgrade to split chart
Follow local lab getting started with split chart
Follow GKE getting started with old chart and upgrade to split chart
Follow GKE getting started with split chart
Follow EKS getting started with old chart and upgrade to split chart
Follow EKS getting started with split chart (broken, blocked by Fix Kubernetes version detection on EKS #19188)
Load-test the chart before and after this PR to estimate the performance impact

Should address:

make service annotations optional when in chartMode aws to avoid deprecated in-tree service annotation #16220
Provide a way to disable the kubernetes_service in the teleport-cluster Helm chart #18774
Expose MOTD Settings in Helm #18569
Azure and Etcd support on Helm chart #18483 (does not provides a happy path but supports it)
add proxy protocol to helm chart #18457
Helm cluster deployment: provide option to split auth and proxy pods #16871
helm: Add proxy-peering support in teleport-cluster helm chart #16096 (although it should be tested and we might want to add a containerPort to officialise proxy-peering support)
helm: Add distributed tracing support in both Helm charts #16095 (does not provides a happy path but support it)
Request for Service Monitor to be Added to Teleport-Cluster Helm Chart #13260
helm chartMode:custom and Config and PVC mount [improvement suggestion] #9713

examples/chart/teleport-cluster/templates/auth/_config.common.tpl

hugoShaka · 2022-12-08T16:52:42Z

Upgrading from a past chart caused ~3 min of downtime, mainly caused by:

auth startuup time: 45s
proxy connect retry: 30s
proxy readiness: 30s
load balacing pool update: 30s

I'm not sure how to minimize this. One could deploy a second release next to the original one, then changing the DNS record to point to the new LB. The biggest issue would be reverse-tunnel clients trying to connect to unreachable proxy, but this might be acceptable.

webvictim

Looks really good overall. I think the new templating style is going to give people a lot more flexibility.

We might want to update the README a bit to remove the values listing and point people to the comprehensive chart reference too... which unfortunately is also going to need overhauling as part of the changes 😢

examples/chart/teleport-cluster/templates/auth/_config.common.tpl

examples/chart/teleport-cluster/templates/auth/_config.scratch.tpl

examples/chart/teleport-cluster/templates/auth/pvc.yaml

examples/chart/teleport-cluster/templates/auth/statefulset.yaml

examples/chart/teleport-cluster/templates/proxy/_config.scratch.tpl

examples/chart/teleport-cluster/tests/auth_config_test.yaml

examples/chart/teleport-cluster/values.yaml

examples/chart/teleport-cluster/templates/psp.yaml

tigrato

Looks good.

examples/chart/teleport-cluster/templates/auth/_config.common.tpl

examples/chart/teleport-cluster/templates/proxy/_config.scratch.tpl

examples/chart/teleport-cluster/templates/_helpers.tpl

examples/chart/teleport-cluster/templates/auth/_config.common.tpl

examples/chart/teleport-cluster/templates/auth/_config.scratch.tpl

examples/chart/teleport-cluster/tests/__snapshot__/auth_config_test.yaml.snap

Part of [RFD-0096](#18274) This PR adds helm hooks deploying a test configuration job and running `teleport configure --test` to validate the `teleport.yaml` configuration is sane.

…itor

hugoShaka · 2023-01-06T21:45:34Z

Documentation PR is here: #19881

I'll wait until there are no more comments on this one to rebase and send it for review to the docs team.

tigrato · 2023-01-09T10:57:33Z

examples/chart/teleport-cluster/values.yaml

+# TLS multiplexing is not supported when using ACM+NLB for TLS termination.
+#
+# Possible values are 'separate' and 'multiplex'
+proxyListenerMode: "separate"


Shouldn't the default value be multiplex nowadays?

I think more people are terminating TLS in front of Teleport than not when deploying in Kubernetes. The use of ACM is widespread, and the lack of Ingress support is a huge gripe people have with the Helm charts in general. As such I think it's best if we're still explicit about requiring the use of separate ports/listeners until such time as Teleport can work reliably behind TLS termination. If we default to multiplex here we're just making a rod for our own backs.

If/when we'll manage to have Teleport run behind an Ingress we'll be able to multiplex by default on those setups. Based on support requests a lot of people are using ACM-based setups and switching to multiplex will break their setups.

Related issue: #19975

… podmonitor

webvictim

🎉

examples/chart/teleport-cluster/templates/auth/_config.scratch.tpl

examples/chart/teleport-cluster/templates/auth/deployment.yaml

examples/chart/teleport-cluster/templates/auth/pvc.yaml

examples/chart/teleport-cluster/templates/proxy/_config.common.tpl

examples/chart/teleport-cluster/templates/proxy/_config.scratch.tpl

examples/chart/teleport-cluster/templates/proxy/service.yaml

examples/chart/teleport-cluster/templates/psp.yaml

hugoShaka · 2023-01-10T18:44:59Z

Following discussions, I searched GitHub issues and learned Teleport had several issues with stale proxy/auth nodes in the past. I'll run a couple more tests to observe how the cluster reacts during rollouts and validate that the topology change does not introduce a regression/make things worse.

hugoShaka · 2023-01-10T19:39:33Z

With split auth/proxy

The auth rollout took ~5min for the metrics to go back to normal

The proxy rollout happened really fast, and all nodes reconnected real quick. However, CPU usage increased dramatically after the rollout and did not go back to regular levels (for proxy, nodes, and a bit for auth).

According to the logs, the nodes were discovering 3 or 4 proxies for 30 minutes. Once the nodes stopped discovering stale proxies, the CPU usage went back to nominal usage.

With auth and proxy bundled

The proxy rollout issue also happens

Split impact

We were already facing facing the issue #20057 in bundled mode. This is non-blocking.

Splitting auth and proxies cause auth rollouts to disconnect every node. They can take up to 5 minutes to reconnect. Most of the wait does look like a bug though: #8793

While the auth rollout impact is not ideal, this does not seem to be a blocker. We might want to investigate why reconnection is delayed, so the update experience becomes smoother.

zmb3

bot

This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```

…20449) This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```

hugoShaka added the helm label Nov 29, 2022

hugoShaka force-pushed the hugo/chart-split-proxy-auth branch 7 times, most recently from 3c762c9 to 6a00b79 Compare December 7, 2022 14:05

hugoShaka force-pushed the hugo/chart-split-proxy-auth branch 2 times, most recently from 958d896 to fb912cc Compare December 7, 2022 15:19

hugoShaka requested review from tigrato, webvictim, marcoandredinis and r0mant December 7, 2022 15:45

hugoShaka marked this pull request as ready for review December 7, 2022 15:55

github-actions bot requested review from atburke and jakule December 7, 2022 15:55

hugoShaka commented Dec 7, 2022

View reviewed changes

examples/chart/teleport-cluster/templates/auth/_config.common.tpl Outdated Show resolved Hide resolved

hugoShaka commented Dec 7, 2022

View reviewed changes

examples/chart/teleport-cluster/templates/auth/_config.common.tpl Show resolved Hide resolved

corkrean mentioned this pull request Dec 8, 2022

added motd to helm chart #18587

Closed

hugoShaka commented Dec 8, 2022

View reviewed changes

examples/chart/teleport-cluster/templates/auth/_config.common.tpl Outdated Show resolved Hide resolved

webvictim reviewed Dec 8, 2022

View reviewed changes

hugoShaka force-pushed the hugo/chart-split-proxy-auth branch from 013f71f to d397909 Compare December 8, 2022 20:10

tigrato reviewed Dec 12, 2022

View reviewed changes

examples/chart/teleport-cluster/templates/auth/_config.common.tpl Outdated Show resolved Hide resolved

This was referenced Dec 12, 2022

helm: add PodMonitor support #19291

Merged

helm: add job validating configuration on deploy #19333

Merged

hugoShaka force-pushed the hugo/chart-split-proxy-auth branch from f844519 to 620b06f Compare December 15, 2022 14:08

hugoShaka mentioned this pull request Dec 15, 2022

add option to disable creation of ClusteRole and ClusterRoleBinding #17570

Merged

marcoandredinis reviewed Dec 16, 2022

View reviewed changes

hugoShaka added 4 commits January 5, 2023 16:43

Treat gus' feedback + avoid auth servicename overflow

8f847bf

Fix webauthn templating, scratch proxy config + marco's feedback

5e49dd6

helm: add PodMonitor support (#19291)

ec07530

helm: add job validating configuration on deploy (#19333)

3ec9f8f

Part of [RFD-0096](#18274) This PR adds helm hooks deploying a test configuration job and running `teleport configure --test` to validate the `teleport.yaml` configuration is sane.

hugoShaka force-pushed the hugo/chart-split-proxy-auth branch from c83b7d1 to 3ec9f8f Compare January 5, 2023 21:44

hugoShaka added 3 commits January 6, 2023 11:40

helm: use a deployment instead of a statefulset for auth + fix podmon…

b2962e3

…itor

helm: warn that multiplexing requires TLS termination

4edfcdd

helm: fix regression with predeploy job and gcp credentials

dfe4a53

marcoandredinis approved these changes Jan 9, 2023

View reviewed changes

tigrato approved these changes Jan 9, 2023

View reviewed changes

fixup! helm: use a deployment instead of a statefulset for auth + fix…

49157b0

… podmonitor

webvictim approved these changes Jan 9, 2023

View reviewed changes

helm: explicit update strategy + add documentation links

1f3ef01

zmb3 approved these changes Jan 11, 2023

View reviewed changes

github-actions bot removed request for r0mant, jakule and atburke January 11, 2023 17:35

Merge branch 'master' into hugo/chart-split-proxy-auth

db5aa60

hugoShaka enabled auto-merge (squash) January 11, 2023 17:46

hugoShaka merged commit 4ca4b54 into master Jan 11, 2023

hugoShaka mentioned this pull request Jan 12, 2023

helm: support passing raw config in teleport-kube-agent #20129

Merged

hugoShaka deleted the hugo/chart-split-proxy-auth branch January 13, 2023 14:58

webvictim mentioned this pull request Jan 13, 2023

Provide a way to disable the kubernetes_service in the teleport-cluster Helm chart #18774

Closed

fheinecke mentioned this pull request Jun 3, 2024

helm teleport/teleport-cluster should be able to expose 3025 #8444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy split proxy/auth with helm chart #18857

Deploy split proxy/auth with helm chart #18857

hugoShaka commented Nov 29, 2022 •

edited

Loading

hugoShaka commented Dec 8, 2022 •

edited

Loading

webvictim left a comment

tigrato left a comment

hugoShaka commented Jan 6, 2023

tigrato Jan 9, 2023

webvictim Jan 9, 2023 •

edited

Loading

hugoShaka Jan 9, 2023

hugoShaka Jan 11, 2023

webvictim left a comment

hugoShaka commented Jan 10, 2023

hugoShaka commented Jan 10, 2023 •

edited

Loading

zmb3 left a comment

Deploy split proxy/auth with helm chart #18857

Deploy split proxy/auth with helm chart #18857

Conversation

hugoShaka commented Nov 29, 2022 • edited Loading

hugoShaka commented Dec 8, 2022 • edited Loading

webvictim left a comment

Choose a reason for hiding this comment

tigrato left a comment

Choose a reason for hiding this comment

hugoShaka commented Jan 6, 2023

tigrato Jan 9, 2023

Choose a reason for hiding this comment

webvictim Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

hugoShaka Jan 9, 2023

Choose a reason for hiding this comment

hugoShaka Jan 11, 2023

Choose a reason for hiding this comment

webvictim left a comment

Choose a reason for hiding this comment

hugoShaka commented Jan 10, 2023

hugoShaka commented Jan 10, 2023 • edited Loading

With split auth/proxy

With auth and proxy bundled

Split impact

zmb3 left a comment

Choose a reason for hiding this comment

hugoShaka commented Nov 29, 2022 •

edited

Loading

hugoShaka commented Dec 8, 2022 •

edited

Loading

webvictim Jan 9, 2023 •

edited

Loading

hugoShaka commented Jan 10, 2023 •

edited

Loading