Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache per-cluster SSH certificates under ~/.tsh #5938

Merged
merged 11 commits into from
Mar 29, 2021

Conversation

andrejtokarcik
Copy link
Contributor

@andrejtokarcik andrejtokarcik commented Mar 10, 2021

 ~/.tsh/
 └── keys
    ├── one.example.com            --> Proxy hostname
    │   ├── certs.pem              --> TLS CA certs for the Teleport CA
    │   ├── foo                    --> RSA Private Key for user "foo"
    │   ├── foo.pub                --> Public Key
-   │   ├── foo-cert.pub           --> SSH certificate for proxies and nodes
    │   ├── foo-x509.pem           --> TLS client certificate for Auth Server
+   │   ├── foo-ssh                --> SSH certs for user "foo"
+   │   │   ├── root-cert.pub      --> SSH cert for Teleport cluster "root"
+   │   │   └── leaf-cert.pub      --> SSH cert for Teleport cluster "leaf"

When -J is set, this also loads/reissues the SSH cert for the cluster associated with the jumphost's certificate. Fixes #5637.

Warning: Although I'm pretty confident about the current design, more testing is still needed, especially for scenarios involving MFA, access requests, k8s/DB/app access.

@andrejtokarcik andrejtokarcik changed the title Cache SSH certificates per cluster under ~/.tsh Cache per-cluster SSH certificates under ~/.tsh Mar 10, 2021
@andrejtokarcik andrejtokarcik force-pushed the andrej/feat/per-cluster-ssh-certs branch 10 times, most recently from c3d9dda to 82d3d63 Compare March 11, 2021 19:38
@andrejtokarcik andrejtokarcik marked this pull request as ready for review March 11, 2021 20:32
@andrejtokarcik andrejtokarcik requested a review from awly March 11, 2021 20:32
@andrejtokarcik andrejtokarcik requested a review from Joerger March 16, 2021 13:55
Copy link
Contributor

@awly awly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine, but we definitely need to test k8s/db access and u2f before merging because there's a ton of refactoring in here.
Ping me after you address the comments and I can help with some of it.

Comment on lines -134 to -161

// read in key for this user in proxy
key, err := a.GetKey()
if err != nil {
if trace.IsNotFound(err) {
return a, nil
}
return nil, trace.Wrap(err)
}

a.log.Infof("Loading key for %q", username)

// load key into the agent
_, err = a.LoadKey(*key)
if err != nil {
return nil, trace.Wrap(err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to fetch a SSH key we now have to provide a cluster name. Here in KeyAgent, however, the choice of the cluster isn't known, so this initial key loading cannot be done so simply anymore. It has to be moved outside to a context that can indicate which of the cluster-specific SSH keys is meant to be loaded.

type withKubeCerts struct {
teleportClusterName string
// WithSSHCerts is a CertOption for handling SSH certificates.
type WithSSHCerts struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are now stateless, can you make them into vars?

var WithSSHCerts = withSSHCerts{}

type withSSHCerts struct {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the exported CertOption structs for possible field additions in the future. Also, WithDBCerts still has some state (dbName for logging out of a specific DB).

@andrejtokarcik andrejtokarcik force-pushed the andrej/feat/per-cluster-ssh-certs branch 2 times, most recently from bd04762 to 858d292 Compare March 23, 2021 00:26
@andrejtokarcik
Copy link
Contributor Author

andrejtokarcik commented Mar 23, 2021

Please re-review, I introduced new changes with the rebase.

I had to roll back on my idea of requesting certs even for clusters specified just with --cluster. Such behaviour would cause a regression as it'd completely prevent SSO users from using --cluster because of the error:

User [...] tried to issue a cert for externally managed user, this is not supported.

Update: The error seems to have been fixed so it comes up only in connection with impersonation requests.

@awly
Copy link
Contributor

awly commented Mar 29, 2021

@andrejtokarcik something doesn't handle missing SSH certs correctly:

$ rm ~/.tsh/keys/localhost/awly-ssh/*
$ tsh ssh -J localhost:3023 talos
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x17ad8c2]

goroutine 44 [running]:
github.com/gravitational/teleport/lib/client.(*ProxyClient).Close(0x0, 0x0, 0x0)
	/home/awly/src/teleport/lib/client/client.go:1103 +0x22
panic(0x1a4fa80, 0x29ea8c0)
	/home/awly/.go/src/runtime/panic.go:965 +0x1b9
github.com/gravitational/teleport/lib/client.(*ProxyClient).localAgent(...)
	/home/awly/src/teleport/lib/client/client.go:1398
github.com/gravitational/teleport/lib/client.(*ProxyClient).reissueUserCerts(0x0, 0x1ed5668, 0xc000214540, 0xc00071219c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/awly/src/teleport/lib/client/client.go:203 +0x6b8
github.com/gravitational/teleport/lib/client.(*ProxyClient).ReissueUserCerts(0x0, 0x1ed5668, 0xc000214540, 0xc00071219c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/awly/src/teleport/lib/client/client.go:186 +0x7d
github.com/gravitational/teleport/lib/client.(*TeleportClient).ReissueUserCerts(0xc00034cf00, 0x1ed5668, 0xc000214540, 0xc00071219c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/awly/src/teleport/lib/client/api.go:1133 +0x165
github.com/gravitational/teleport/lib/client.(*TeleportClient).LoadKeyForClusterWithReissue(0xc00034cf00, 0x1ed5668, 0xc000214540, 0xc00071219c, 0x4, 0x0, 0xc0001a3a78)
	/home/awly/src/teleport/lib/client/api.go:1069 +0x138
github.com/gravitational/teleport/lib/client.(*TeleportClient).loadKeyForClusterFromCert.func1.1(0xc00034cf00, 0x14, 0xc000180a01)
	/home/awly/src/teleport/lib/client/api.go:2010 +0x4e
github.com/gravitational/teleport/lib/client.(*TeleportClient).WithoutJumpHosts(0xc00034cf00, 0xc0001a3c30, 0x1c7802b, 0x14)
	/home/awly/src/teleport/lib/client/api.go:2027 +0x74
github.com/gravitational/teleport/lib/client.(*TeleportClient).loadKeyForClusterFromCert.func1(0x7fff711f0ce3, 0xe, 0x1eaa1d8, 0xc0001d82d0, 0x1ec88b0, 0xc00027a000, 0x0, 0x0)
	/home/awly/src/teleport/lib/client/api.go:2009 +0x1ad
golang.org/x/crypto/ssh.(*handshakeTransport).client(0xc000260000, 0x1ea6fd8, 0x2ab09c0, 0xc00025e300, 0xc000258300, 0x1, 0x0, 0xa9d145)
	/home/awly/src/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:641 +0x19a
golang.org/x/crypto/ssh.(*handshakeTransport).enterKeyExchange(0xc000260000, 0xc000374380, 0x1b7, 0x1b7, 0x2, 0xc000232f01)
	/home/awly/src/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:587 +0x668
golang.org/x/crypto/ssh.(*handshakeTransport).kexLoop(0xc000260000)
	/home/awly/src/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:301 +0x1a5
created by golang.org/x/crypto/ssh.newClientTransport
	/home/awly/src/teleport/vendor/golang.org/x/crypto/ssh/handshake.go:135 +0x1a7

#
# I stopped the server here
#
 
$ tsh ssh -J localhost:3023 talos
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x17ad8c2]

goroutine 1 [running]:
github.com/gravitational/teleport/lib/client.(*ProxyClient).Close(0x0, 0x0, 0x0)
	/home/awly/src/teleport/lib/client/client.go:1103 +0x22
panic(0x1a4fa80, 0x29ea8c0)
	/home/awly/.go/src/runtime/panic.go:971 +0x499
github.com/gravitational/teleport/lib/client.(*ProxyClient).GetSites(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/awly/src/teleport/lib/client/client.go:80 +0x66
github.com/gravitational/teleport/lib/client.(*ProxyClient).currentCluster(0x0, 0x1ed5668, 0xc000214940, 0x0)
	/home/awly/src/teleport/lib/client/client.go:1352 +0x2f
github.com/gravitational/teleport/lib/client.(*TeleportClient).SSH(0xc000153900, 0x1ed5668, 0xc000214940, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/awly/src/teleport/lib/client/api.go:1253 +0xe5
main.onSSH.func1(0x0, 0x0)
	/home/awly/src/teleport/tool/tsh/tsh.go:1416 +0x6c
github.com/gravitational/teleport/lib/client.RetryWithRelogin(0x1ed5668, 0xc000214940, 0xc000153900, 0xc00054b5d0, 0x0, 0x1c70c90)
	/home/awly/src/teleport/lib/client/api.go:442 +0x3c
main.onSSH(0xc0000abc00, 0x1c61253, 0x3)
	/home/awly/src/teleport/tool/tsh/tsh.go:1415 +0x105
main.Run(0xc000142010, 0x4, 0x4, 0x0, 0x0, 0x0, 0x40ed25, 0xc00010e058)
	/home/awly/src/teleport/tool/tsh/tsh.go:533 +0x99e7
main.main()
	/home/awly/src/teleport/tool/tsh/tsh.go:256 +0x13d

@awly
Copy link
Contributor

awly commented Mar 29, 2021

Looks like the above problem only happens when I delete the root SSH cert.
Deleting only the leaf cert works as expected:

$ rm ~/.tsh/keys/localhost/awly-ssh/leaf-cert.pub
$ tsh ssh -J localhost:4023 talos # port 4023 is the leaf proxy
> # logged in
the connection was closed on the remote side on  29 Mar 21 11:34 PDT

Co-authored-by: Andrew Lytvynov <andrew@goteleport.com>
}
// try to authenticate using every non interactive auth method we have:
var errs []error
for i, m := range tc.authMethods() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, tc.authMethods() will return nil if SSH cert is missing for the cluster.
tc.authMethods (and the underlying LocalKeyAgent.AuthMethods()) should return an error if no SSH cert is present.

@awly
Copy link
Contributor

awly commented Mar 29, 2021

@klizhentas for docs - when the user upgrades tsh to 6.1, they will need to re-login because the SSH cert is stored in a different location.
We think it's an acceptable price for not leaving compatibility/migration logic in profile loading, but let us know if you think otherwise.

@andrejtokarcik andrejtokarcik force-pushed the andrej/feat/per-cluster-ssh-certs branch from 3e0edc8 to 60795b8 Compare March 29, 2021 19:18
@andrejtokarcik andrejtokarcik force-pushed the andrej/feat/per-cluster-ssh-certs branch from 60795b8 to 914067f Compare March 29, 2021 19:23
m = append(m, agentMethods...)
}
if len(m) == 0 {
return nil, trace.BadParameter("no auth method available")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, trace.BadParameter("no auth method available")
return nil, trace.NotFound("no SSH auth method available, try logging in again")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't reformulate in this way as it would make the relogin log unreadable. Current version:

DEBU [CLIENT]    Activating relogin on no auth method available. client/api.go:455

Also, RetryWithRelogin wouldn't handle NotFoundError as well:

teleport/lib/client/api.go

Lines 447 to 451 in 879f8c2

// Assume that failed handshake is a result of expired credentials,
// retry the login procedure
if !utils.IsHandshakeFailedError(err) && !utils.IsCertExpiredError(err) && !trace.IsBadParameter(err) && !trace.IsTrustError(err) {
return trace.Wrap(err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, as long as this error is never returned to users in non-debug output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe it would actually help in handling the errors more correctly when the user has to re-login manually....

for i := range signers {
// filter out non-certificates (like regular public SSH keys stored in the SSH agent):
_, ok := signers[i].PublicKey().(*ssh.Certificate)
if ok {
m = append(m, sshutils.NewAuthMethodForCert(signers[i]))
}
}
return m
if len(m) == 0 {
return nil, trace.BadParameter("no auth method available")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, trace.BadParameter("no auth method available")
return nil, trace.NotFound("no SSH auth method available, try logging in again")

@awly
Copy link
Contributor

awly commented Mar 29, 2021

LGTM, there are still a few edge cases with corrupt ~/.tsh contents, but we can make the code more robust later.
Everything works well, as long as the user doesn't go messing with ~/.tsh

@andrejtokarcik andrejtokarcik enabled auto-merge (squash) March 29, 2021 21:02
@andrejtokarcik andrejtokarcik merged commit 52dfeec into master Mar 29, 2021
@andrejtokarcik andrejtokarcik deleted the andrej/feat/per-cluster-ssh-certs branch March 29, 2021 21:14
pierrebeaucamp pushed a commit that referenced this pull request Mar 31, 2021
…e/dynamodb-gsi-autoscaling

* 'master' of github.com:gravitational/teleport: (41 commits)
  Refactor ssh.ClientConfig used by tctl and API clients to use the first valid principal as User.
  Update Architecture Overview With Link To User Roles (#6224)
  Add `lint-api` target and fix lint errors (#6169)
  ssh: fix relogin with jumphosts (#6213)
  drone: use emptyDir for /var/lib/docker filesystem and prevent repetitive docker pulls (#6145)
  Remove ARM64 FIPS builds (#6236)
  tsh Profile SSH certs fix (#6214)
  mfa: fix gRPC unimplemented check in cert reissue
  Open Sources Access Controls Docs (#6188) (#6217)
  add PAM environment with interpolation support
  Cache per-cluster SSH certificates under ~/.tsh (#5938)
  add special resource type for access plugin data
  Enable DynamoDB autoscaling on global secondary indices (#6112)
  darwin fips builds (#5866)
  kube: add kubernetes_labels to role JSON schema
  mfa: send username instead of SSH login name in MFA cert request
  fix nil slice bug
  RFD 16: Add a section on `tctl rm` resetting resources back to defaults (#5673)
  Update application access docs (#6055) (#6137)
  Bump linux FIPS builds to use go1.16.2b7 release (#6143)
  ...
@Joerger Joerger mentioned this pull request Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement RFD 19: Cluster Routing
4 participants