Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement optional in-memory proxy cache #3320

Merged
merged 1 commit into from
Feb 6, 2020
Merged

Conversation

klizhentas
Copy link
Contributor

This commit resolves #3227

In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.

Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.

The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.

This commit introduces the new cache configuration option, 'in-memory':

teleport:
  cache:
    # default value sqlite,
    # the only supported values are sqlite or in-memory
    type: in-memory

This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.

The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.

The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.

@benarent
Copy link
Contributor

benarent commented Feb 2, 2020

In HA mode does is this cache set for the Auth node or the Proxy Node, or both? Depending what it controls we might want to put the setting under the specific yaml section.

Should we provide guidelines or how the in-memory cache works in AWS, if a customer is using DynamoDB, do they also need to set the cache for Teleport to scale?

Lastly, should we output diagnostic information to /metrics

Also, while clearing out my e-mail. I wonder if this will also help this issue #2870 (comment)

@webvictim webvictim changed the title Sasha/in memory Implement optional in-memory proxy cache Feb 3, 2020
Copy link
Contributor

@russjones russjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benarent We should document that in this mode proxies will initialize their cache on boot. This means you trade availability (if proxies are rebooted during an outage of Auth Servers, they won't be able to start) for performance (can scale to a larger number of nodes).

return trace.ConnectionProblem(nil, "discovery channel overflow at %v", len(c.newProxiesC))
// Missing proxies update is no longer critical with more permissive
// discovery protocol that tolerates conflicting, stale or missing updates
c.log.Warnf("discovery channel overflow at %v", len(c.newProxiesC))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization and punctuation.

@klizhentas
Copy link
Contributor Author

@fspmarshall ping

@klizhentas
Copy link
Contributor Author

@benarent

Auth servers are always using in-memory cache, they do not persist the private key material of CA to disk.

In HA mode this affects proxies and nodes, with this cache, as @russjones noted, proxies will not be able to tolerate the auth servers outage after the proxies restart, the cache data will be lost. Right now by default proxy servers will tolerate auth servers outage even if proxies reboot.

Copy link
Contributor

@fspmarshall fspmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Note that this PR includes an older version of the changes from #3305 which could create minor merge conflict. Might be best to either remove those changes here, or port the new state of #3305 and close that PR.

@klizhentas klizhentas force-pushed the sasha/in-memory branch 2 times, most recently from 41413c8 to 6be9331 Compare February 5, 2020 16:28
@klizhentas
Copy link
Contributor Author

retest this please

This commit resolves #3227

In IOT mode, 10K nodes are connecting back to the proxies, putting
a lot of pressure on the proxy cache.

Before this commit, Proxy's only cache option were persistent
sqlite-backed caches. The advantage of those caches that Proxies
could continue working after reboots with Auth servers unavailable.

The disadvantage is that sqlite backend breaks down on many concurrent
reads due to performance issues.

This commit introduces the new cache configuration option, 'in-memory':

```yaml
teleport:
  cache:
    # default value sqlite,
    # the only supported values are sqlite or in-memory
    type: in-memory
```

This cache mode allows two m4.4xlarge proxies to handle 10K IOT mode connected
nodes with no issues.

The second part of the commit disables the cache reload on timer that caused
inconsistent view results for 10K displayed nodes with servers disappearing
from the view.

The third part of the commit increases the channels buffering discovery
requests 10x. The channels were overfilling in 10K nodes and nodes
were disconnected. The logic now does not treat the channel overflow
as a reason to close the connection. This is possible due to the changes
in the discovery protocol that allow target nodes to handle missing
entries, duplicate entries or conflicting values.
@klizhentas
Copy link
Contributor Author

retest this please

14 similar comments
@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas
Copy link
Contributor Author

retest this please

@klizhentas klizhentas merged commit a22f7be into master Feb 6, 2020
@klizhentas klizhentas deleted the sasha/in-memory branch March 15, 2021 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to Scale a Teleport Cluster to support 10k IoT nodes.
4 participants