-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teleport 4.0.2 fails to trust cluster after upgrade #2870
Comments
Hey @benarent I just had a window in which I was able to try this upgrade again. I set the two options in our auth server config and the proxy servers started to run out of file descriptors.
For reference: $ cat /proc/sys/fs/file-max
393133 |
Umm, that doesn't look good. @russjones should be able to provide a bit more insight into this. |
If you have several clusters and many nodes, it's not uncommon to run out of file descriptors when the limit is set to a default. The actual limit on open file descriptors is probably quite different to what's in
If you're running Teleport using systemd, you can edit your systemd unit file (usually If you're not using systemd, I would probably look to edit something under |
Hey @webvictim Thanks for the advice!
|
I'm not sure either - I don't believe we've changed anything that would make that an issue. The default keepalives in Teleport 4.0.3 have been changed to 5 minutes (rather than the original 15 minutes) which is probably a better value to use.
|
Hey Folks, random comment, we ran into this issue today, seeing: the Our ulimits were very high. It turned out our dynamo backend was hitting write capacity. After increasing the dynamo db limits and performing a rolling restart of our auth servers, services came back to normal. I'm wondering if the writes are maybe queued on disk for that auth servers if they can't write immediately to dynamo, which might trigger the file-related error messaging. Someone smarter can confirm. A cluster upgrade might have a similar effect as onboarding a large amount of nodes at once. It was kind of hard to identify the problem via logs, but the dynamo db monitoring charts showed us where the bottleneck was. cc' @ghcoi2ck |
What happened:
Friday July 19th, we attempted to upgrade our teleport clusters from 3.2 to 4.0. Following the upgrade guide we upgraded our main cluster to 4.0, successfully. The 4.0 main cluster and the 3.2 trusted clusters were able to communicate and operate. When we upgraded our trusted clusters to 4.0, we saw the following error messages in the logs:
WARN [PROXY:AGE] Unable to continue processesing requests: heartbeat: agent is stopped.
DEBU [PROXY:AGE] Missed, connected to [28c2927d-a87a-4e35-9884-2ee5951d7d3f.<master-name> <trusted-hostname>.<master-name> <trusted-hostname> <master-hostname> remote.kube.proxy.teleport.cluster.local] instead of f9c96cf4-a6c1-4d5c-a312-80c0ad5098e1. target:<master-hostname>:3024 reversetunnel/agent.go:419
The trusted clusters then became unstable. They would be intermittently visible and accessible from the master cluster. During this time, all of their nodes still were listed as reachable (via
tcl nodes ls
), the only thing that appeared not to work was the inter-cluster connectivity.What you expected to happen:
All 4.0 clusters communicating with other 4.0 clusters in a trusted cluster relationship.
How to reproduce it (as minimally and precisely as possible):
Environment:
teleport version
):Teleport v4.0.2 git:v4.0.2-0-gb7e0e872 go1.12.1
tsh version
):Teleport v4.0.2 git: go1.12.6
Linux <redacted> 4.14.62-70.117.amzn2.x86_64 #1 SMP Fri Aug 10 20:14:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Relevant Debug Logs If Applicable
The text was updated successfully, but these errors were encountered: