Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teleport 4.0.2 fails to trust cluster after upgrade #2870

Closed
hmadison opened this issue Jul 22, 2019 · 7 comments
Closed

Teleport 4.0.2 fails to trust cluster after upgrade #2870

hmadison opened this issue Jul 22, 2019 · 7 comments
Assignees
Labels
support-request Used to group support related questions.

Comments

@hmadison
Copy link
Contributor

hmadison commented Jul 22, 2019

What happened:

Friday July 19th, we attempted to upgrade our teleport clusters from 3.2 to 4.0. Following the upgrade guide we upgraded our main cluster to 4.0, successfully. The 4.0 main cluster and the 3.2 trusted clusters were able to communicate and operate. When we upgraded our trusted clusters to 4.0, we saw the following error messages in the logs:

  • WARN [PROXY:AGE] Unable to continue processesing requests: heartbeat: agent is stopped.
  • DEBU [PROXY:AGE] Missed, connected to [28c2927d-a87a-4e35-9884-2ee5951d7d3f.<master-name> <trusted-hostname>.<master-name> <trusted-hostname> <master-hostname> remote.kube.proxy.teleport.cluster.local] instead of f9c96cf4-a6c1-4d5c-a312-80c0ad5098e1. target:<master-hostname>:3024 reversetunnel/agent.go:419

The trusted clusters then became unstable. They would be intermittently visible and accessible from the master cluster. During this time, all of their nodes still were listed as reachable (via tcl nodes ls), the only thing that appeared not to work was the inter-cluster connectivity.

What you expected to happen:

All 4.0 clusters communicating with other 4.0 clusters in a trusted cluster relationship.

How to reproduce it (as minimally and precisely as possible):

  • Set up two 3.2 enterprise clusters using the HA guidelines and establish trust.
    • AWS NLB as the load balancer
    • DynamoDB/S3 as the storage backend
  • Upgrade the master 4.0 enterprise.
  • Upgrade the trusted cluster to 4.0.
  • Observe how the trusted cluster fails to connect to the master in a stable fashion.

Environment:

  • Teleport version (use teleport version): Teleport v4.0.2 git:v4.0.2-0-gb7e0e872 go1.12.1
  • Tsh version (use tsh version): Teleport v4.0.2 git: go1.12.6
  • OS (e.g. from /etc/os-release): Linux <redacted> 4.14.62-70.117.amzn2.x86_64 #1 SMP Fri Aug 10 20:14:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Relevant Debug Logs If Applicable

@hmadison hmadison changed the title Teleport 4.0.2 fails to trust cluster after Teleport 4.0.2 fails to trust cluster after upgrade Jul 22, 2019
@benarent benarent added bug support-request Used to group support related questions. and removed bug labels Jul 22, 2019
@benarent
Copy link
Contributor

Hi @hmadison,

This is related to #2845, which comes down to how NLBs deal with timeouts. We plan to fix this in 4.1, but for now you can change this to your teleport.yaml on the Auth server.

Add this on Auth server.

  keep_alive_interval: "5s"
  keep_alive_count_max: 3

@hmadison
Copy link
Contributor Author

Hey @benarent

I just had a window in which I was able to try this upgrade again. I set the two options in our auth server config and the proxy servers started to run out of file descriptors.

Jul 29 14:51:12 <proxy-hostname> teleport[25204]: DEBU [PROXY:SER] Ping <- 54.80.48.223:33934. latency:334.663µs cluster:<trusted cluster> reversetunnel/remotesite.go:365
Jul 29 14:51:12 <proxy-hostname> teleport[25204]: ERRO [AUTH]      Failed to dial auth server <auth server nlb>:3025: dial tcp: lookup <auth server nlb> on 172.31.0.2:53: dial udp 172.31.0.2:53: socket: too many open files. auth/clt.go:144

For reference:

$ cat /proc/sys/fs/file-max
393133

@benarent
Copy link
Contributor

Umm, that doesn't look good. @russjones should be able to provide a bit more insight into this.

@webvictim
Copy link
Contributor

If you have several clusters and many nodes, it's not uncommon to run out of file descriptors when the limit is set to a default.

The actual limit on open file descriptors is probably quite different to what's in /proc/sys/fs/file-max - what does ulimit -n say?

$ cat /proc/sys/fs/file-max
3271074
$ ulimit -n
1024

If you're running Teleport using systemd, you can edit your systemd unit file (usually /lib/systemd/system/teleport.service), add LimitNOFILE=65535 (or some similar high number) to the [Service] section and then restart Teleport.

If you're not using systemd, I would probably look to edit something under /etc/security/limits.d.

@hmadison
Copy link
Contributor Author

Hey @webvictim

Thanks for the advice!

LimitNOFILE was set to 4096. I'm still unsure as to why this became an issue going from version 3.2 to 4.0.

@webvictim
Copy link
Contributor

I'm not sure either - I don't believe we've changed anything that would make that an issue.

The default keepalives in Teleport 4.0.3 have been changed to 5 minutes (rather than the original 15 minutes) which is probably a better value to use.

  keep_alive_interval: "300s"
  keep_alive_count_max: 3

@sover02
Copy link

sover02 commented Dec 12, 2019

Hey Folks, random comment, we ran into this issue today, seeing: the socket: too many open files. error while onboarding a large amount of nodes all at once.

Our ulimits were very high.

It turned out our dynamo backend was hitting write capacity. After increasing the dynamo db limits and performing a rolling restart of our auth servers, services came back to normal.

I'm wondering if the writes are maybe queued on disk for that auth servers if they can't write immediately to dynamo, which might trigger the file-related error messaging. Someone smarter can confirm. A cluster upgrade might have a similar effect as onboarding a large amount of nodes at once. It was kind of hard to identify the problem via logs, but the dynamo db monitoring charts showed us where the bottleneck was.

cc' @ghcoi2ck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support-request Used to group support related questions.
Projects
None yet
Development

No branches or pull requests

5 participants