Teleport 4.0.2 fails to trust cluster after upgrade #2870

hmadison · 2019-07-22T11:38:18Z

What happened:

Friday July 19th, we attempted to upgrade our teleport clusters from 3.2 to 4.0. Following the upgrade guide we upgraded our main cluster to 4.0, successfully. The 4.0 main cluster and the 3.2 trusted clusters were able to communicate and operate. When we upgraded our trusted clusters to 4.0, we saw the following error messages in the logs:

WARN [PROXY:AGE] Unable to continue processesing requests: heartbeat: agent is stopped.
DEBU [PROXY:AGE] Missed, connected to [28c2927d-a87a-4e35-9884-2ee5951d7d3f.<master-name> <trusted-hostname>.<master-name> <trusted-hostname> <master-hostname> remote.kube.proxy.teleport.cluster.local] instead of f9c96cf4-a6c1-4d5c-a312-80c0ad5098e1. target:<master-hostname>:3024 reversetunnel/agent.go:419

The trusted clusters then became unstable. They would be intermittently visible and accessible from the master cluster. During this time, all of their nodes still were listed as reachable (via tcl nodes ls), the only thing that appeared not to work was the inter-cluster connectivity.

What you expected to happen:

All 4.0 clusters communicating with other 4.0 clusters in a trusted cluster relationship.

How to reproduce it (as minimally and precisely as possible):

Set up two 3.2 enterprise clusters using the HA guidelines and establish trust.
- AWS NLB as the load balancer
- DynamoDB/S3 as the storage backend
Upgrade the master 4.0 enterprise.
Upgrade the trusted cluster to 4.0.
Observe how the trusted cluster fails to connect to the master in a stable fashion.

Environment:

Teleport version (use teleport version): Teleport v4.0.2 git:v4.0.2-0-gb7e0e872 go1.12.1
Tsh version (use tsh version): Teleport v4.0.2 git: go1.12.6
OS (e.g. from /etc/os-release): Linux <redacted> 4.14.62-70.117.amzn2.x86_64 #1 SMP Fri Aug 10 20:14:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Relevant Debug Logs If Applicable

teleport --debug

The text was updated successfully, but these errors were encountered:

benarent · 2019-07-22T21:41:28Z

Hi @hmadison,

This is related to #2845, which comes down to how NLBs deal with timeouts. We plan to fix this in 4.1, but for now you can change this to your teleport.yaml on the Auth server.

Add this on Auth server.

  keep_alive_interval: "5s"
  keep_alive_count_max: 3

hmadison · 2019-07-29T15:08:42Z

Hey @benarent

I just had a window in which I was able to try this upgrade again. I set the two options in our auth server config and the proxy servers started to run out of file descriptors.

Jul 29 14:51:12 <proxy-hostname> teleport[25204]: DEBU [PROXY:SER] Ping <- 54.80.48.223:33934. latency:334.663µs cluster:<trusted cluster> reversetunnel/remotesite.go:365
Jul 29 14:51:12 <proxy-hostname> teleport[25204]: ERRO [AUTH]      Failed to dial auth server <auth server nlb>:3025: dial tcp: lookup <auth server nlb> on 172.31.0.2:53: dial udp 172.31.0.2:53: socket: too many open files. auth/clt.go:144

For reference:

$ cat /proc/sys/fs/file-max
393133

benarent · 2019-07-29T16:56:09Z

Umm, that doesn't look good. @russjones should be able to provide a bit more insight into this.

webvictim · 2019-07-29T18:52:50Z

If you have several clusters and many nodes, it's not uncommon to run out of file descriptors when the limit is set to a default.

The actual limit on open file descriptors is probably quite different to what's in /proc/sys/fs/file-max - what does ulimit -n say?

$ cat /proc/sys/fs/file-max
3271074
$ ulimit -n
1024

If you're running Teleport using systemd, you can edit your systemd unit file (usually /lib/systemd/system/teleport.service), add LimitNOFILE=65535 (or some similar high number) to the [Service] section and then restart Teleport.

If you're not using systemd, I would probably look to edit something under /etc/security/limits.d.

hmadison · 2019-07-30T17:58:21Z

Hey @webvictim

Thanks for the advice!

LimitNOFILE was set to 4096. I'm still unsure as to why this became an issue going from version 3.2 to 4.0.

webvictim · 2019-07-30T18:55:27Z

I'm not sure either - I don't believe we've changed anything that would make that an issue.

The default keepalives in Teleport 4.0.3 have been changed to 5 minutes (rather than the original 15 minutes) which is probably a better value to use.

  keep_alive_interval: "300s"
  keep_alive_count_max: 3

sover02 · 2019-12-12T00:28:56Z

Hey Folks, random comment, we ran into this issue today, seeing: the socket: too many open files. error while onboarding a large amount of nodes all at once.

Our ulimits were very high.

It turned out our dynamo backend was hitting write capacity. After increasing the dynamo db limits and performing a rolling restart of our auth servers, services came back to normal.

I'm wondering if the writes are maybe queued on disk for that auth servers if they can't write immediately to dynamo, which might trigger the file-related error messaging. Someone smarter can confirm. A cluster upgrade might have a similar effect as onboarding a large amount of nodes at once. It was kind of hard to identify the problem via logs, but the dynamo db monitoring charts showed us where the bottleneck was.

cc' @ghcoi2ck

hmadison changed the title ~~Teleport 4.0.2 fails to trust cluster after~~ Teleport 4.0.2 fails to trust cluster after upgrade Jul 22, 2019

benarent added bug support-request Used to group support related questions. and removed bug labels Jul 22, 2019

benarent assigned russjones Jul 29, 2019

webvictim closed this as completed Aug 13, 2019

benarent mentioned this issue Feb 2, 2020

Implement optional in-memory proxy cache #3320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teleport 4.0.2 fails to trust cluster after upgrade #2870

Teleport 4.0.2 fails to trust cluster after upgrade #2870

hmadison commented Jul 22, 2019 •

edited by benarent

Loading

benarent commented Jul 22, 2019

hmadison commented Jul 29, 2019

benarent commented Jul 29, 2019

webvictim commented Jul 29, 2019

hmadison commented Jul 30, 2019

webvictim commented Jul 30, 2019

sover02 commented Dec 12, 2019 •

edited

Loading

Teleport 4.0.2 fails to trust cluster after upgrade #2870

Teleport 4.0.2 fails to trust cluster after upgrade #2870

Comments

hmadison commented Jul 22, 2019 • edited by benarent Loading

benarent commented Jul 22, 2019

hmadison commented Jul 29, 2019

benarent commented Jul 29, 2019

webvictim commented Jul 29, 2019

hmadison commented Jul 30, 2019

webvictim commented Jul 30, 2019

sover02 commented Dec 12, 2019 • edited Loading

hmadison commented Jul 22, 2019 •

edited by benarent

Loading

sover02 commented Dec 12, 2019 •

edited

Loading