Implement a functional prototype #2

klizhentas · 2015-03-03T03:55:51Z

The prototype should have the following capabilities:

Should support Etcd as a configuration backend
Should support CA public key authorization
Authority API and cli for generating and signing public keys
Tunneling SSH through SSH using Agent forwarding
Structured logs and tracing of all the session activity
Multiplexing subsystem for fancy command execution on multiple hosts
TCP port forwarding through tunnel to any server (mostly care about auth)
JS shell to any server (figure out the auth)

When running `tsh ssh foo@bar cmd` we end up dialing `bar` twice - once to (maybe) start port forwarding and a second time to execute `cmd`. Instead, reuse the first connection to run `cmd` and only fall back to re-dialing if we're matching multiple nodes by label. This gives ~20-30% speedup for non-interactive commands (useful for tools like ansible): ``` > hyperfine 'tsh ssh localhost true' '~/src/teleport/build/tsh ssh localhost true' Benchmark #1: tsh ssh localhost true Time (mean ± σ): 65.5 ms ± 5.0 ms [User: 12.9 ms, System: 6.1 ms] Range (min … max): 57.0 ms … 74.2 ms 41 runs Benchmark #2: ~/src/teleport/build/tsh ssh localhost true Time (mean ± σ): 51.7 ms ± 3.2 ms [User: 9.0 ms, System: 5.0 ms] Range (min … max): 48.5 ms … 68.5 ms 57 runs Summary '~/src/teleport/build/tsh ssh localhost true' ran 1.27 ± 0.12 times faster than 'tsh ssh localhost true' ```

This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing

* mfa: per-session MFA certs for SSH and Kubernetes This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing * Capture access approvals on MFA-issued certs * Address review feedback * Address review feedback * mfa: accept unknown nodes during short-term MFA cert creation An unknown node could be an OpenSSH node set up via https://goteleport.com/teleport/docs/openssh-teleport/ In this case, we shouldn't prevent the user from connecting. There's a small risk of authz bypass - an attacker might know a different name/IP for a registered node which Teleport doesn't know about. But a Teleport node will still check RBAC and reject the connection. * Validate username against unmapped user identity IssueUserCertsWithMFA is called on the leaf auth server in case of trusted clusters. Username in the request object will be that of the original unmapped caller. * mfa: add IsMFARequired RPC This RPC is ran before every connection to check whether MFA is required. If a connection is against the leaf cluster, this request is forwarded from root to leaf for evaluation. * Fix integration tests * Correctly treat "Username" as login name in IsMFARequired Also, move the logic into auth.Server out of ServerWithRoles. * Fix TestHA * Address review feedback

* docs: home section except for admin guide * docs: adminguide, ttl, oss * docs: corrected * docs: requested changes * docs: update product name for 6.2+ * docs: make ha references uniform * docs: make ha references uniform * docs: most of the reference section * docs: improve bash snippets * docs: add tiles to api landing * docs: typo * docs: bash corrections * docs: corrections * docs: update cli-docs

This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403. Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited. The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting: 1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another. 2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern. This commit chooses option #2. While #1 has advantages, particularly as endpoints and new use cases are added. #2 provides the strictest and safest rate limits. Our rate limits were configured to: period: 1 min avg rate: 10 burst rate: 20 In order to build a safety buffer with option #2 those allowed rates were doubled.

This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403. Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited. The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting: 1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another. 2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern. This commit chooses option #2. While #1 has advantages, particularly as endpoints and new use cases are added. #2 provides the strictest and safest rate limits. Our rate limits were configured to: period: 1 min avg rate: 10 burst rate: 20 In order to build a safety buffer with option #2 those allowed rates were doubled. Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added. This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.

* Rate limit all unauthenticated HTTP endpoints This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403. Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited. The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting: 1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another. 2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern. This commit chooses option #2. While #1 has advantages, particularly as endpoints and new use cases are added. #2 provides the strictest and safest rate limits. Our rate limits were configured to: period: 1 min avg rate: 10 burst rate: 20 In order to build a safety buffer with option #2 those allowed rates were doubled. Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added. This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests. * Add additional `High` Rate Limiting This new `High` rate limit is designed for endpoints which are only CPU bound (and thus don't have as significant of DoS risks). Initially this was motivated for `ping` and `find` due to the concern that these endpoints are used unauthenticated at login, and potential NAT's may result in very high rates from single egress IP's. In my testing on my laptop, all of these endpoints can easily get 640/req/sec on a single core within a VM. Setting the maximum of 480 burst and 120 continuous should both ensure that no single source utilizes all the CPU, as well as build in additional safety margins while providing a layer of protection. * Fix for missing error check

If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted.

* Always attempt desktop discovery, even if LDAP is not ready If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted. * Periodically use the LDAP connection when discovery is not enabled If LDAP-based discovery is not enabled then we may go long periods of time without trying to use the LDAP connection, which prevents us from detecting disconnects (and restoring the connection) in a timely manner. When discovery is disabled, perform a read every 5 minutes and reconnect if we detect a connection problem. * Address review comments

If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted.

* Always attempt desktop discovery, even if LDAP is not ready If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted. * Periodically use the LDAP connection when discovery is not enabled If LDAP-based discovery is not enabled then we may go long periods of time without trying to use the LDAP connection, which prevents us from detecting disconnects (and restoring the connection) in a timely manner. When discovery is disabled, perform a read every 5 minutes and reconnect if we detect a connection problem. * Address review comments * Fix some LDAP connection bugs In #36281 we made some improvements to the LDAP reconnect behavior. These changes considered the case where we had a connection to the LDAP server but then got disconnected. They did not consider the case where we never succesfully established a connection at all.

* Always attempt desktop discovery, even if LDAP is not ready If Teleport loses it's connection to the LDAP server, it will attempt to initiate a new condition when: 1. The user tries to connect to a desktop and Teleport fails to obtain the user's SID. 2. The periodic desktop discovery routine attempts to search LDAP for desktops. In some circumstances, #2 never gets the chance to apply, since discovery is skipped when LDAP is not ready. Additionally, if LDAP is not ready, then you can't connect to a desktop, so #1 can't happen either, which means Teleport won't connect again until it is restarted. * Periodically use the LDAP connection when discovery is not enabled If LDAP-based discovery is not enabled then we may go long periods of time without trying to use the LDAP connection, which prevents us from detecting disconnects (and restoring the connection) in a timely manner. When discovery is disabled, perform a read every 5 minutes and reconnect if we detect a connection problem. * Address review comments * Fix some LDAP connection bugs In #36281 we made some improvements to the LDAP reconnect behavior. These changes considered the case where we had a connection to the LDAP server but then got disconnected. They did not consider the case where we never succesfully established a connection at all. * Fix typo --------- Co-authored-by: Gus Luxton <gus@goteleport.com>

klizhentas mentioned this issue Mar 3, 2015

Implement a prototype for a proxying SSH server that implements concepts expressed in readme #1

Merged

klizhentas added this to the Demo-ready prototype milestone Mar 13, 2015

klizhentas mentioned this issue Apr 2, 2015

Implement web proxying, and web SSH console proof of concept #5

Merged

klizhentas closed this as completed in #5 Apr 2, 2015

klizhentas mentioned this issue Jan 20, 2017

Teleport usability issues #717

Closed

one000mph mentioned this issue Oct 20, 2019

Architecture revision #3093

Merged

wadells mentioned this issue Feb 4, 2020

Update Terraform code to support Terraform 0.12.x and AWS provider 2.x #3319

Merged

alex-kovoy mentioned this issue Mar 16, 2021

RFD 19 - Event Fetch API with Pagination #5885

Merged

inertial-frame mentioned this issue Jun 2, 2021

[v6.2] Edit Pass #6942

Merged

zmb3 mentioned this issue Oct 6, 2021

API release automation with go script #8484

Merged

greedy52 mentioned this issue Jan 24, 2022

Fail to connect to RDS instance in ca-central-1 #9887

Closed

hatched pushed a commit to hatched/teleport-merge that referenced this issue Nov 30, 2022

Update README.md (gravitational#2)

0ac0a37

xacrimon added a commit that referenced this issue Mar 1, 2023

add github.com/google/go-attestation/attest to e imports #2

222591a

xacrimon added a commit that referenced this issue Mar 3, 2023

add github.com/google/go-attestation/attest to e imports #2 (#22465)

5ee96ce

r0mant mentioned this issue Apr 13, 2023

Teleport 13 Test Plan #24576

Closed

jentfoo mentioned this issue Apr 14, 2023

Rate limit all unauthenticated HTTP endpoints #24623

Merged

Tener mentioned this issue May 5, 2023

Fix FuzzRPCClientPartialLength test (MSSQL Server) #25658

Merged

nick-inkeep pushed a commit to nick-inkeep/teleport-docs that referenced this issue Jun 20, 2023

css-modules migration (gravitational#2)

8495a8b

tcsc mentioned this issue Jul 24, 2023

User can't recover signed identities on Helm since v13 #29262

Closed

r0mant mentioned this issue Aug 28, 2023

Teleport 14 Test Plan #31122

Closed

greedy52 mentioned this issue Sep 11, 2023

make tsh match against discovered db name #31689

Merged

gecube mentioned this issue Dec 4, 2023

Provide a way to install a teleport on dedicated bare metal node of k8s #35331

Open

jakule mentioned this issue Dec 27, 2023

[Draft] Cross-compilation MVP #36066

Draft

r0mant mentioned this issue Jan 13, 2024

Teleport 15 Test Plan #36663

Closed

strideynet mentioned this issue Jan 24, 2024

Improve test plan instructions for moderated sessions #37167

Merged

greedy52 mentioned this issue Feb 8, 2024

TLS Routing native WebSocket connection upgrade support #36343

Merged

13 tasks

arianvp mentioned this issue Mar 18, 2024

When pam is enabled and systemd-logind is used, processes spawned by teleport are not moved into the user session's scope unit in the cgroup tree #39501

Open

greedy52 mentioned this issue Aug 19, 2024

AWS audit event shown in Web UI is not accurate #45603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a functional prototype #2

Implement a functional prototype #2

klizhentas commented Mar 3, 2015

Implement a functional prototype #2

Implement a functional prototype #2

Comments

klizhentas commented Mar 3, 2015