-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: TLS certs generated via connect
cause one-way connectivity problem
#61624
Comments
@aaron-crl this is the specific issue that was blocking me last Monday. I can still reproduce this at will. The description at top pertains a scenario using 2 different machines. I can also reproduce this by using a single machine but different port numbers, for example:
Observe the log files: one node can connect to the other, but the other one complains when it tries to connect back:
|
I can also repro this, and also with the node-join stuff. Might take a look as Aaron is away at the moment. |
Previously, non-trust-leader nodes couldn't connect back to the trust leader due to the presence of the wrong `ca-client.crt` on their disk; the main CA cert/key was being written in four places. This change fixes that bug, and also creates a new `client.node.crt` certificate to prevent other subsequent errors from being thrown. Fixes cockroachdb#61624. Release note: None.
…nnect` The end-to-end test for the new `connect` command was incomplete, because of issue cockroachdb#61624 that was blocking the functionality. Now that cockroachdb#63589 is in, we can add the missing test. Release note: None
63589: server, security: Fix one-way connectivity with connect cmd r=knz a=itsbilal Informs #60632. Previously, non-trust-leader nodes couldn't connect back to the trust leader due to the presence of the wrong `ca-client.crt` on their disk; the main CA cert/key was being written in four places. This change fixes that bug, and also creates a new `client.node.crt` certificate to prevent other subsequent errors from being thrown. Fixes #61624. Release note: None. 63672: kvserver: fix write below closedts bug r=andreimatei a=andreimatei This patch fixes a bug in our closed timestamp management. This bug was making it possible for a command to close a timestamp even though other requests writing at lower timestamps are currently evaluating. The problem was that we were assuming that, if a replica is proposing a new lease, there can't be any requests in flight and every future write evaluated on the range will wait for the new lease and the evaluate above the lease start time. Based on that reasoning, the proposal buffer was recording the lease start time as its assignedClosedTimestamp. This was matching what it does for every write, where assignedClosedTimestamp corresponds to the the closed timestamp carried by the command. It turns out that the replica's reasoning was wrong. It is, in fact, possible for writes to be evaluating on the range when the lease acquisition is proposed. And these evaluations might be done at timestamps below the would-be lease's start time. This happens when the replica has already received a lease through a lease transfer. The transfer must have applied after the previous lease expired and the replica decided to start acquiring a new one. This fixes one of the assertion failures seen in #62655. Release note (bug fix): A bug leading to crashes with the message "writing below closed ts" has been fixed. 63756: backupccl: reset restored jobs during cluster restore r=dt a=pbardea Previously, jobs were restored without modification during cluster restore. Due to a recently discovered bug where backup may miss non-transactional writes written to offline spans by these jobs, their progress may no longer be accurate on the restored cluster. IMPORT and RESTORE jobs perform non-transactional writes that may be missed. When a cluster RESTORE brings back these OFFLINE tables, it will also bring back its associated job. To ensure the underlying data in these tables is correct, the jobs are now set in a reverting state so that they can clean up after themselves. In-progress schema change jobs that are affected, will fail upon validation. Release note (bug fix): Fix a bug where restored jobs may have assumed to have made progress that was not captured in the backup. The restored jobs are now either canceled cluster restore. 63837: build: update the go version requirement for `make` r=otan a=knz Fixes #63837. The builder image already requires go 1.15.10. This patch modifies the check for a non-builder `make` command to require at least the same version. Release note: None Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
…nnect` The end-to-end test for the new `connect` command was incomplete, because of issue cockroachdb#61624 that was blocking the functionality. Now that cockroachdb#63589 is in, we can add the missing test. Release note: None
63846: cli/interactive_tests: complete the end-to-end test for `cockroach connect` r=itsbilal a=knz The end-to-end test for the new `connect` command was incomplete, because of issue #61624 that was blocking the functionality. Now that #63589 is in, we can add the missing test. Release note: None 63921: schemaexpr: fix data race in ProcessColumnSet r=adityamaru a=postamar This commit fixes a data race introduced by my recent changes tracked under #63755, involving the generalized use of catalog.Column instead of descpb.ColumnDescriptor. Fixes #63907 Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Marius Posta <marius@cockroachlabs.com>
A couple changes in this commit: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624. Release note: None.
A couple changes in this commit: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624. Release note: None.
This change adds a new CLI command, `connect join`, that lets a new node retrieve CA certificates off of an existing secure cluster and bootstrap its own TLS certificates for joining that cluster. This is achieved through the consumption of a join token stored in the join_tokens table. Join tokens can be created using the `create_join_tokens()` sql builtin function, added in a previous change. The previous `connect` command has now been renamed to `connect init`. A new set of GRPC endpoints have been created to handle the server side of this change; to support retrieval of CAs as well as entire initialization cert bundles. The cert bundles with CAs and private keys aren't sent over the wire until a node join token has been verified and consumed. The receiving node (the one running `connect join`) then stores them in its SSLCertsDir and bootstraps its own certificates off of it. This feature is hidden behind a feature flag. A couple other changes in this commit to clean up related code in the `security` package: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Pass around raw byte certificates in auto_tls_init.go instead of doing excess PEM encodings and decodings. - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624. Release note (cli change): Rename `connect` to `connect init`, and add `connect join` command to retrieve certificates from an existing secure cluster and setup a new node to connect with it. Co-authored-by: Aaron Blum <aaron@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
63168: ALTER SCHEMA, CREATE SCHEMA, DROP SCHEMA diagrams r=ericharmeling a=ericharmeling Release justification: non-production code changes Release note: None 63492: server, security: Token-based add/join TLS functionality r=knz a=itsbilal Informs #60632. This change adds a new CLI command, `connect join`, that lets a new node retrieve CA certificates off of an existing secure cluster and bootstrap its own TLS certificates for joining that cluster. This is achieved through the consumption of a join token stored in the join_tokens table. Join tokens can be created using the `create_join_tokens()` sql builtin function, added in a previous change. The previous `connect` command has now been renamed to `connect init`. A new set of GRPC endpoints have been created to handle the server side of this change; to support retrieval of CAs as well as entire initialization cert bundles. The cert bundles with CAs and private keys aren't sent over the wire until a node join token has been verified and consumed. The receiving node (the one running `connect join`) then stores them in its SSLCertsDir and bootstraps its own certificates off of it. This feature is hidden behind a feature flag. A couple other changes in this commit to clean up related code in the `security` package: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Pass around raw byte certificates in auto_tls_init.go instead of doing excess PEM encodings and decodings. - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes #61624. Release note (cli change): Rename `connect` to `connect init`, and add `connect join` command to retrieve certificates from an existing secure cluster and setup a new node to connect with it. Co-authored-by: Aaron Blum <aaron@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Here is what I used:
on machine
192.168.2.10
I ran the following command:./cockroach connect --num-expected-initial-nodes 2 --init-token abc --listen-addr=192.168.2.10 --join=192.168.2.19:26258
on machine
192.168.2.19
I ran the following:./cockroach connect --num-expected-initial-nodes 2 --init-token abc --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081
This made the connect command complete successfully.
Note: beware of mentioning the port number in
--join
(because of issue #61620) and explicit IP addresses in--listen-addr
(because of issues #61619 and #61616)Then as recommended by the
connect
command I ran the following, which worked:./cockroach cert create-client root --ca-key=~/.cockroach-certs/ca-client.key
Then I started my CockroachDB nodes:
on machine
192.168.2.10
:./cockroach start --join=192.168.2.19:26258 --listen-addr=192.168.2.10
on machine
192.168.2.19
:./cockroach start --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081
Here I start observing in logs something that is unexpected / undesirable (first symptom of the problem):
on 192.168.2.10 logs are fine:
I210308 15:50:28.243256 206 server/init.go:420 ⋮ [n?] 28 ‹192.168.2.19:26258› is itself waiting for init, will retry
This indicates that this server is able to establish an outgoing RPC conn to the other one.
on 192.168.2.19, we see the problem:
W210308 16:05:46.896090 150 server/init.go:422 ⋮ [n?] 41 outgoing join rpc to ‹192.168.2.10:26257› unsuccessful: ‹rpc error: code = Unauthenticated desc = TLSInfo is not a vailable in request context›
This indicates that this server is unable to establish its outgoing RPC conn to the other one.
at this point I was suspecting that maybe the
init
RPC is special and uses a different TLS configuration that an already-initialized server. So I ran the following, which worked without errors:./cockroach start --join=192.168.2.19:26258 --listen-addr=192.168.2.10 --insecure
./cockroach start --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081 --insecure
./cockroach init --host=... --port=... --insecure
This initializes the cluster and assigns node ID without connectivity errors.
Then I re-start the already-initialized servers, using the same commands as previously. Then ISee:
Related to #60632
cc @aaron-crl @itsbilal
The text was updated successfully, but these errors were encountered: