Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a functional prototype #2

Closed
8 tasks done
klizhentas opened this issue Mar 3, 2015 · 0 comments · Fixed by #5
Closed
8 tasks done

Implement a functional prototype #2

klizhentas opened this issue Mar 3, 2015 · 0 comments · Fixed by #5

Comments

@klizhentas
Copy link
Contributor

The prototype should have the following capabilities:

  • Should support Etcd as a configuration backend
  • Should support CA public key authorization
  • Authority API and cli for generating and signing public keys
  • Tunneling SSH through SSH using Agent forwarding
  • Structured logs and tracing of all the session activity
  • Multiplexing subsystem for fancy command execution on multiple hosts
  • TCP port forwarding through tunnel to any server (mostly care about auth)
  • JS shell to any server (figure out the auth)
@klizhentas klizhentas added this to the Demo-ready prototype milestone Mar 13, 2015
awly pushed a commit that referenced this issue Jul 17, 2020
When running `tsh ssh foo@bar cmd` we end up dialing `bar` twice - once to
(maybe) start port forwarding and a second time to execute `cmd`.
Instead, reuse the first connection to run `cmd` and only fall back to
re-dialing if we're matching multiple nodes by label.

This gives ~20-30% speedup for non-interactive commands (useful for
tools like ansible):

```
> hyperfine 'tsh ssh localhost true' '~/src/teleport/build/tsh ssh localhost true'
Benchmark #1: tsh ssh localhost true
  Time (mean ± σ):      65.5 ms ±   5.0 ms    [User: 12.9 ms, System: 6.1 ms]
  Range (min … max):    57.0 ms …  74.2 ms    41 runs

Benchmark #2: ~/src/teleport/build/tsh ssh localhost true
  Time (mean ± σ):      51.7 ms ±   3.2 ms    [User: 9.0 ms, System: 5.0 ms]
  Range (min … max):    48.5 ms …  68.5 ms    57 runs

Summary
  '~/src/teleport/build/tsh ssh localhost true' ran
    1.27 ± 0.12 times faster than 'tsh ssh localhost true'
```
awly pushed a commit that referenced this issue Jul 20, 2020
When running `tsh ssh foo@bar cmd` we end up dialing `bar` twice - once to
(maybe) start port forwarding and a second time to execute `cmd`.
Instead, reuse the first connection to run `cmd` and only fall back to
re-dialing if we're matching multiple nodes by label.

This gives ~20-30% speedup for non-interactive commands (useful for
tools like ansible):

```
> hyperfine 'tsh ssh localhost true' '~/src/teleport/build/tsh ssh localhost true'
Benchmark #1: tsh ssh localhost true
  Time (mean ± σ):      65.5 ms ±   5.0 ms    [User: 12.9 ms, System: 6.1 ms]
  Range (min … max):    57.0 ms …  74.2 ms    41 runs

Benchmark #2: ~/src/teleport/build/tsh ssh localhost true
  Time (mean ± σ):      51.7 ms ±   3.2 ms    [User: 9.0 ms, System: 5.0 ms]
  Range (min … max):    48.5 ms …  68.5 ms    57 runs

Summary
  '~/src/teleport/build/tsh ssh localhost true' ran
    1.27 ± 0.12 times faster than 'tsh ssh localhost true'
```
awly pushed a commit that referenced this issue Feb 18, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this issue Feb 22, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this issue Feb 26, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this issue Mar 3, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this issue Mar 5, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this issue Mar 10, 2021
This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing
awly pushed a commit that referenced this issue Mar 10, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
awly pushed a commit that referenced this issue Mar 25, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
awly pushed a commit that referenced this issue Mar 29, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
awly pushed a commit that referenced this issue Mar 29, 2021
* mfa: per-session MFA certs for SSH and Kubernetes

This is client-side support for requesting single-use certs with an MFA
check.

The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:

```
$ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date'
Benchmark #1: /tmp/tsh-old ssh talos date
  Time (mean ± σ):      49.9 ms ±   1.0 ms    [User: 15.1 ms, System: 7.4 ms]
  Range (min … max):    48.4 ms …  54.1 ms    59 runs

Benchmark #2: /tmp/tsh-new ssh talos date
  Time (mean ± σ):      60.2 ms ±   1.6 ms    [User: 19.1 ms, System: 8.3 ms]
  Range (min … max):    59.0 ms …  69.7 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  '/tmp/tsh-old ssh talos date' ran
    1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date'
```

Another few other internal changes:

- client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
  Previously, it would be nil (e.g. in a web UI handler or when using an
  identity file) which easily causes panics. I added a noLocalKeyStore
  type instead that returns errors from all methods.

- requesting a user cert with a TTL < 1min will now succeed and return a
  1min cert instead of failing

* Capture access approvals on MFA-issued certs

* Address review feedback

* Address review feedback

* mfa: accept unknown nodes during short-term MFA cert creation

An unknown node could be an OpenSSH node set up via
https://goteleport.com/teleport/docs/openssh-teleport/

In this case, we shouldn't prevent the user from connecting.

There's a small risk of authz bypass - an attacker might know a
different name/IP for a registered node which Teleport doesn't know
about. But a Teleport node will still check RBAC and reject the
connection.

* Validate username against unmapped user identity

IssueUserCertsWithMFA is called on the leaf auth server in case of
trusted clusters. Username in the request object will be that of the
original unmapped caller.

* mfa: add IsMFARequired RPC

This RPC is ran before every connection to check whether MFA is
required. If a connection is against the leaf cluster, this request is
forwarded from root to leaf for evaluation.

* Fix integration tests

* Correctly treat "Username" as login name in IsMFARequired

Also, move the logic into auth.Server out of ServerWithRoles.

* Fix TestHA

* Address review feedback
inertial-frame added a commit that referenced this issue Jun 3, 2021
* docs: home section except for admin guide

* docs: adminguide, ttl, oss

* docs: corrected

* docs: requested changes

* docs: update product name for 6.2+

* docs: make ha references uniform

* docs: make ha references uniform

* docs: most of the reference section

* docs: improve bash snippets

* docs: add tiles to api landing

* docs: typo

* docs: bash corrections

* docs: corrections

* docs: update cli-docs
hatched pushed a commit to hatched/teleport-merge that referenced this issue Nov 30, 2022
jentfoo added a commit that referenced this issue Apr 14, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.
jentfoo added a commit that referenced this issue Apr 18, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.
jentfoo added a commit that referenced this issue Apr 25, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added.  This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.
jentfoo added a commit that referenced this issue Apr 27, 2023
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added.  This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.
jentfoo added a commit that referenced this issue May 1, 2023
* Rate limit all unauthenticated HTTP endpoints

This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added.  This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.

* Add additional `High` Rate Limiting

This new `High` rate limit is designed for endpoints which are only CPU bound (and thus don't have as significant of DoS risks).
Initially this was motivated for `ping` and `find` due to the concern that these endpoints are used unauthenticated at login, and potential NAT's may result in very high rates from single egress IP's.
In my testing on my laptop, all of these endpoints can easily get 640/req/sec on a single core within a VM.  Setting the maximum of 480 burst and 120 continuous should both ensure that no single source utilizes all the CPU, as well as build in additional safety margins while providing a layer of protection.

* Fix for missing error check
nick-inkeep pushed a commit to nick-inkeep/teleport-docs that referenced this issue Jun 20, 2023
zmb3 added a commit that referenced this issue Jan 6, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
zmb3 added a commit that referenced this issue Oct 28, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
github-merge-queue bot pushed a commit that referenced this issue Oct 28, 2024
* Always attempt desktop discovery, even if LDAP is not ready

If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.

* Periodically use the LDAP connection when discovery is not enabled

If LDAP-based discovery is not enabled then we may go long periods
of time without trying to use the LDAP connection, which prevents
us from detecting disconnects (and restoring the connection) in a
timely manner.

When discovery is disabled, perform a read every 5 minutes and
reconnect if we detect a connection problem.

* Address review comments
github-actions bot pushed a commit that referenced this issue Oct 28, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
github-actions bot pushed a commit that referenced this issue Oct 28, 2024
If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.
github-merge-queue bot pushed a commit that referenced this issue Oct 29, 2024
* Always attempt desktop discovery, even if LDAP is not ready

If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.

* Periodically use the LDAP connection when discovery is not enabled

If LDAP-based discovery is not enabled then we may go long periods
of time without trying to use the LDAP connection, which prevents
us from detecting disconnects (and restoring the connection) in a
timely manner.

When discovery is disabled, perform a read every 5 minutes and
reconnect if we detect a connection problem.

* Address review comments

* Fix some LDAP connection bugs

In #36281 we made some improvements to the LDAP reconnect behavior.
These changes considered the case where we had a connection to the
LDAP server but then got disconnected. They did not consider the case
where we never succesfully established a connection at all.
github-merge-queue bot pushed a commit that referenced this issue Nov 18, 2024
* Always attempt desktop discovery, even if LDAP is not ready

If Teleport loses it's connection to the LDAP server, it will
attempt to initiate a new condition when:

1. The user tries to connect to a desktop and Teleport fails
   to obtain the user's SID.
2. The periodic desktop discovery routine attempts to search
   LDAP for desktops.

In some circumstances, #2 never gets the chance to apply, since
discovery is skipped when LDAP is not ready. Additionally, if
LDAP is not ready, then you can't connect to a desktop, so #1
can't happen either, which means Teleport won't connect again
until it is restarted.

* Periodically use the LDAP connection when discovery is not enabled

If LDAP-based discovery is not enabled then we may go long periods
of time without trying to use the LDAP connection, which prevents
us from detecting disconnects (and restoring the connection) in a
timely manner.

When discovery is disabled, perform a read every 5 minutes and
reconnect if we detect a connection problem.

* Address review comments

* Fix some LDAP connection bugs

In #36281 we made some improvements to the LDAP reconnect behavior.
These changes considered the case where we had a connection to the
LDAP server but then got disconnected. They did not consider the case
where we never succesfully established a connection at all.

* Fix typo

---------

Co-authored-by: Gus Luxton <gus@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant