-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to Scale a Teleport Cluster to support 10k IoT nodes. #3227
Comments
@JonGilmore we will benchmark ASAP and get back to you |
@klizhentas happy new years! have you been able to perform any benchmarking? |
Following up from todays call, along with scaling question the team is seeing a lot these issues in the Proxy Logs.
|
@benarent any updates on the Gravitational end on these errors? |
@benarent any updates? |
Hey Jon, we are still working on it internally. We'll keep an update in Slack. |
DescriptionThese are some benchmark results of Teleport 4.2.0 with Teleport IOT connected nodes using managed AWS deployment. Setup:
Some dynamo metrics: Both auth servers, proxies have the following connection_limits:
Socket limits:
ResultsThe system becomes unstable during edge cases (e.g. all nodes restart). SQLite caches lock down the Proxy on the surge caused by reconnects of nodes:
This situation on full proxy restart holds for about 2-3 minutes causing all 16 cores of proxy to spike up and lock down. This is especially caused by the lack of good randomized backoff on reconnects from nodes as they surge connect all at once. Eventually the system stabilizes, and works OK, however this introduces usability and general concerns, as full restart of the proxies is possible during day to day operations. |
@klizhentas thank you for the reply. Currently, we're scaled to (6) c5.xl proxy nodes and (3) c5.2xl auth nodes and still seeing sporadic behavior (disconnects, not all nodes reporting when we try to run a |
This commit resolves #3227 In IOT mode, 10K nodes are connecting back to the proxies, putting a lot of pressure on the proxy cache. Before this commit, Proxy's only cache option were persistent sqlite-backed caches. The advantage of those caches that Proxies could continue working after reboots with Auth servers unavailable. The disadvantage is that sqlite backend breaks down on many concurrent reads due to performance issues. This commit introduces the new cache configuration option, 'in-memory': ```yaml teleport: cache: # default value sqlite, # the only supported values are sqlite or in-memory type: in-memory ``` This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected nodes with no issues. The second part of the commit disables the cache reload on timer that caused inconsistent view results for 10K displayed nodes with servers disappearing from the view. The third part of the commit increases the channels buffering discovery requests 10x. The channels were overfilling in 10K nodes and nodes were disconnected. The logic now does not treat the channel overflow as a reason to close the connection. This is possible due to the changes in the discovery protocol that allow target nodes to handle missing entries, duplicate entries or conflicting values.
This commit resolves #3227 In IOT mode, 10K nodes are connecting back to the proxies, putting a lot of pressure on the proxy cache. Before this commit, Proxy's only cache option were persistent sqlite-backed caches. The advantage of those caches that Proxies could continue working after reboots with Auth servers unavailable. The disadvantage is that sqlite backend breaks down on many concurrent reads due to performance issues. This commit introduces the new cache configuration option, 'in-memory': ```yaml teleport: cache: # default value sqlite, # the only supported values are sqlite or in-memory type: in-memory ``` This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected nodes with no issues. The second part of the commit disables the cache reload on timer that caused inconsistent view results for 10K displayed nodes with servers disappearing from the view. The third part of the commit increases the channels buffering discovery requests 10x. The channels were overfilling in 10K nodes and nodes were disconnected. The logic now does not treat the channel overflow as a reason to close the connection. This is possible due to the changes in the discovery protocol that allow target nodes to handle missing entries, duplicate entries or conflicting values.
This commit resolves #3227 In IOT mode, 10K nodes are connecting back to the proxies, putting a lot of pressure on the proxy cache. Before this commit, Proxy's only cache option were persistent sqlite-backed caches. The advantage of those caches that Proxies could continue working after reboots with Auth servers unavailable. The disadvantage is that sqlite backend breaks down on many concurrent reads due to performance issues. This commit introduces the new cache configuration option, 'in-memory': ```yaml teleport: cache: # default value sqlite, # the only supported values are sqlite or in-memory type: in-memory ``` This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected nodes with no issues. The second part of the commit disables the cache reload on timer that caused inconsistent view results for 10K displayed nodes with servers disappearing from the view. The third part of the commit increases the channels buffering discovery requests 10x. The channels were overfilling in 10K nodes and nodes were disconnected. The logic now does not treat the channel overflow as a reason to close the connection. This is possible due to the changes in the discovery protocol that allow target nodes to handle missing entries, duplicate entries or conflicting values.
This commit resolves #3227 In IOT mode, 10K nodes are connecting back to the proxies, putting a lot of pressure on the proxy cache. Before this commit, Proxy's only cache option were persistent sqlite-backed caches. The advantage of those caches that Proxies could continue working after reboots with Auth servers unavailable. The disadvantage is that sqlite backend breaks down on many concurrent reads due to performance issues. This commit introduces the new cache configuration option, 'in-memory': ```yaml teleport: cache: # default value sqlite, # the only supported values are sqlite or in-memory type: in-memory ``` This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected nodes with no issues. The second part of the commit disables the cache reload on timer that caused inconsistent view results for 10K displayed nodes with servers disappearing from the view. The third part of the commit increases the channels buffering discovery requests 10x. The channels were overfilling in 10K nodes and nodes were disconnected. The logic now does not treat the channel overflow as a reason to close the connection. This is possible due to the changes in the discovery protocol that allow target nodes to handle missing entries, duplicate entries or conflicting values.
This commit resolves #3227 In IOT mode, 10K nodes are connecting back to the proxies, putting a lot of pressure on the proxy cache. Before this commit, Proxy's only cache option were persistent sqlite-backed caches. The advantage of those caches that Proxies could continue working after reboots with Auth servers unavailable. The disadvantage is that sqlite backend breaks down on many concurrent reads due to performance issues. This commit introduces the new cache configuration option, 'in-memory': ```yaml teleport: cache: # default value sqlite, # the only supported values are sqlite or in-memory type: in-memory ``` This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected nodes with no issues. The second part of the commit disables the cache reload on timer that caused inconsistent view results for 10K displayed nodes with servers disappearing from the view. The third part of the commit increases the channels buffering discovery requests 10x. The channels were overfilling in 10K nodes and nodes were disconnected. The logic now does not treat the channel overflow as a reason to close the connection. This is possible due to the changes in the discovery protocol that allow target nodes to handle missing entries, duplicate entries or conflicting values.
What happened:
When we launched Teleport 4.0 we added a bunch of scalability improvements and defined system requirements. https://gravitational.com/teleport/docs/faq/#whats-teleport-scalability-and-hardware-recommendations
We currently only recommend 2,000 nodes connecting with a NAT ( Running in IoT mode ), We've a customer who will have around 10,000 nodes.
This ticket is to track the work required to support Teleport at this scale.
The text was updated successfully, but these errors were encountered: