Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy split proxy/auth with helm chart #18857

Merged
merged 15 commits into from
Jan 11, 2023
Merged

Conversation

hugoShaka
Copy link
Contributor

@hugoShaka hugoShaka commented Nov 29, 2022

This PR implements big chunks of #18274

The following changes are implemented in the PR:

  • split responsibilities between two pod sets: pods and proxies
  • use the kubernetes joinMethod to join proxies
  • have a configuration template per mode: aws, gcp, standalone, scratch
  • scale proxies by default when they are replicable (cert-manager or user-provided cert through secret, in the future we also might consider pods running behind an ingress as replicable)
  • merge user-provided configuration with generated configuration
  • fix initContainer logic (the initContainer field was breaking for more than 1 initContainer)

The following changes have been implemented and reviewed in sub-PRs:

The following changes will be implemented later, in subsequent PRs:

The following changes will not be implemented and the RFD will be modified to reflect this change:

  • using the operator to create proxy join tokens
  • adding a field to deploy CRs alongside the deployment

I am not happy with the state of the tests, but the PR is already too large. I would like to revamp the tests in a subsequent PR.

Before merging I want to run the following tests:

  • Follow local lab getting started with old chart and upgrade to split chart
  • Follow local lab getting started with split chart
  • Follow GKE getting started with old chart and upgrade to split chart
  • Follow GKE getting started with split chart
  • Follow EKS getting started with old chart and upgrade to split chart
  • Follow EKS getting started with split chart (broken, blocked by Fix Kubernetes version detection on EKS #19188)
  • Load-test the chart before and after this PR to estimate the performance impact

Should address:

@hugoShaka hugoShaka added the helm label Nov 29, 2022
@hugoShaka hugoShaka force-pushed the hugo/chart-split-proxy-auth branch 7 times, most recently from 3c762c9 to 6a00b79 Compare December 7, 2022 14:05
@hugoShaka hugoShaka force-pushed the hugo/chart-split-proxy-auth branch 2 times, most recently from 958d896 to fb912cc Compare December 7, 2022 15:19
@hugoShaka hugoShaka marked this pull request as ready for review December 7, 2022 15:55
@hugoShaka
Copy link
Contributor Author

hugoShaka commented Dec 8, 2022

Upgrading from a past chart caused ~3 min of downtime, mainly caused by:

  • auth startuup time: 45s
  • proxy connect retry: 30s
  • proxy readiness: 30s
  • load balacing pool update: 30s

I'm not sure how to minimize this. One could deploy a second release next to the original one, then changing the DNS record to point to the new LB. The biggest issue would be reverse-tunnel clients trying to connect to unreachable proxy, but this might be acceptable.

Copy link
Contributor

@webvictim webvictim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good overall. I think the new templating style is going to give people a lot more flexibility.

We might want to update the README a bit to remove the values listing and point people to the comprehensive chart reference too... which unfortunately is also going to need overhauling as part of the changes 😢

examples/chart/teleport-cluster/templates/auth/pvc.yaml Outdated Show resolved Hide resolved
examples/chart/teleport-cluster/values.yaml Show resolved Hide resolved
examples/chart/teleport-cluster/values.yaml Outdated Show resolved Hide resolved
examples/chart/teleport-cluster/values.yaml Outdated Show resolved Hide resolved
examples/chart/teleport-cluster/templates/psp.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@tigrato tigrato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Part of [RFD-0096](#18274)

This PR adds helm hooks deploying a test configuration job and running `teleport configure --test` to validate the `teleport.yaml` configuration is sane.
@hugoShaka
Copy link
Contributor Author

Documentation PR is here: #19881

I'll wait until there are no more comments on this one to rebase and send it for review to the docs team.

# TLS multiplexing is not supported when using ACM+NLB for TLS termination.
#
# Possible values are 'separate' and 'multiplex'
proxyListenerMode: "separate"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the default value be multiplex nowadays?

Copy link
Contributor

@webvictim webvictim Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more people are terminating TLS in front of Teleport than not when deploying in Kubernetes. The use of ACM is widespread, and the lack of Ingress support is a huge gripe people have with the Helm charts in general. As such I think it's best if we're still explicit about requiring the use of separate ports/listeners until such time as Teleport can work reliably behind TLS termination. If we default to multiplex here we're just making a rod for our own backs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/when we'll manage to have Teleport run behind an Ingress we'll be able to multiplex by default on those setups. Based on support requests a lot of people are using ACM-based setups and switching to multiplex will break their setups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issue: #19975

Copy link
Contributor

@webvictim webvictim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@hugoShaka
Copy link
Contributor Author

Following discussions, I searched GitHub issues and learned Teleport had several issues with stale proxy/auth nodes in the past. I'll run a couple more tests to observe how the cluster reacts during rollouts and validate that the topology change does not introduce a regression/make things worse.

@hugoShaka
Copy link
Contributor Author

hugoShaka commented Jan 10, 2023

With split auth/proxy

The auth rollout took ~5min for the metrics to go back to normal
image

The proxy rollout happened really fast, and all nodes reconnected real quick. However, CPU usage increased dramatically after the rollout and did not go back to regular levels (for proxy, nodes, and a bit for auth).

According to the logs, the nodes were discovering 3 or 4 proxies for 30 minutes. Once the nodes stopped discovering stale proxies, the CPU usage went back to nominal usage.

image

With auth and proxy bundled

The proxy rollout issue also happens
image

Split impact

We were already facing facing the issue #20057 in bundled mode. This is non-blocking.

Splitting auth and proxies cause auth rollouts to disconnect every node. They can take up to 5 minutes to reconnect. Most of the wait does look like a bug though: #8793

While the auth rollout impact is not ideal, this does not seem to be a blocker. We might want to investigate why reconnection is delayed, so the update experience becomes smoother.

Copy link
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bot

@hugoShaka hugoShaka enabled auto-merge (squash) January 11, 2023 17:46
@hugoShaka hugoShaka merged commit 4ca4b54 into master Jan 11, 2023
hugoShaka added a commit that referenced this pull request Jan 12, 2023
This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support.

The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet:

```python
import yaml
import pathlib
import deepdiff

old = yaml.safe_load(Path("./config-snapshot.old").open())
new = yaml.safe_load(Path("./config-snapshot.new").open())

old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() }
new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() }

diff = deepdiff.DeepDiff(old_content, new_content)
print(diff)
```
@hugoShaka hugoShaka deleted the hugo/chart-split-proxy-auth branch January 13, 2023 14:58
hugoShaka added a commit that referenced this pull request Jan 19, 2023
This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support.

The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet:

```python
import yaml
import pathlib
import deepdiff

old = yaml.safe_load(Path("./config-snapshot.old").open())
new = yaml.safe_load(Path("./config-snapshot.new").open())

old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() }
new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() }

diff = deepdiff.DeepDiff(old_content, new_content)
print(diff)
```
hugoShaka added a commit that referenced this pull request Jan 24, 2023
…20449)

This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support.

The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet:

```python
import yaml
import pathlib
import deepdiff

old = yaml.safe_load(Path("./config-snapshot.old").open())
new = yaml.safe_load(Path("./config-snapshot.new").open())

old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() }
new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() }

diff = deepdiff.DeepDiff(old_content, new_content)
print(diff)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants