cli: TLS certs generated via `connect` cause one-way connectivity problem #61624

knz · 2021-03-08T16:10:24Z

Here is what I used:

on machine 192.168.2.10 I ran the following command:

./cockroach connect --num-expected-initial-nodes 2 --init-token abc --listen-addr=192.168.2.10 --join=192.168.2.19:26258
on machine 192.168.2.19 I ran the following:

./cockroach connect --num-expected-initial-nodes 2 --init-token abc --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081

This made the connect command complete successfully.

Note: beware of mentioning the port number in --join (because of issue #61620) and explicit IP addresses in --listen-addr (because of issues #61619 and #61616)

Then as recommended by the connect command I ran the following, which worked:

./cockroach cert create-client root --ca-key=~/.cockroach-certs/ca-client.key

Then I started my CockroachDB nodes:

on machine 192.168.2.10:

./cockroach start --join=192.168.2.19:26258 --listen-addr=192.168.2.10
on machine 192.168.2.19:

./cockroach start --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081
Here I start observing in logs something that is unexpected / undesirable (first symptom of the problem):
- on 192.168.2.10 logs are fine:
  
  I210308 15:50:28.243256 206 server/init.go:420 ⋮ [n?] 28 ‹192.168.2.19:26258› is itself waiting for init, will retry
  
  This indicates that this server is able to establish an outgoing RPC conn to the other one.
- on 192.168.2.19, we see the problem:
  
  W210308 16:05:46.896090 150 server/init.go:422 ⋮ [n?] 41 outgoing join rpc to ‹192.168.2.10:26257› unsuccessful: ‹rpc error: code = Unauthenticated desc = TLSInfo is not a vailable in request context›
  
  This indicates that this server is unable to establish its outgoing RPC conn to the other one.
at this point I was suspecting that maybe the init RPC is special and uses a different TLS configuration that an already-initialized server. So I ran the following, which worked without errors:
- on 192.168.2.10: ./cockroach start --join=192.168.2.19:26258 --listen-addr=192.168.2.10 --insecure
- on 192.168.2.19: ./cockroach start --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081 --insecure
- anywhere: ./cockroach init --host=... --port=... --insecure
This initializes the cluster and assigns node ID without connectivity errors.
Then I re-start the already-initialized servers, using the same commands as previously. Then ISee:
- on 192.168.2.10 I see the following errors:

E210308 16:09:11.689777 446 kv/kvserver/consistency_queue.go:191 ⋮ [n1,consistencyChecker,s1,r22/1:‹/Table/2{6-7}›] 123  computing own checksum: could not dial node ID 1: u
nable to dial n1: ‹breaker open›
E210308 16:09:11.689856 446 kv/kvserver/queue.go:1093 ⋮ [n1,consistencyChecker,s1,r22/1:‹/Table/2{6-7}›] 124  computing own checksum: could not dial node ID 1: unable to di
al n1: ‹breaker open›
E210308 16:09:13.688958 1331 kv/kvserver/consistency_queue.go:191 ⋮ [n1,consistencyChecker,s1,r7/1:‹/Table/1{1-2}›] 125  computing own checksum: could not dial node ID 1: f
ailed to connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context
›

on 192.168.2.19 I see the following errors:

W210308 16:08:49.464792 132 vendor/google.golang.org/grpc/internal/channelz/logging.go:73 ⋮ [-] 55  ‹grpc: addrConn.createTransport failed to connect to {192.168.2.10:26257
  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...›
I210308 16:08:51.678239 257 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 56  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› tripped: failed to
 connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›
I210308 16:08:51.678323 257 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 ⋮ [n2] 57  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› event: ‹BreakerTri
pped›
I210308 16:08:51.678351 257 2@rpc/nodedialer/nodedialer.go:160 ⋮ [ct-client] 58  unable to connect to n1: failed to connect to n1 at ‹192.168.2.10:26257›: ‹initial connecti
on heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›
I210308 16:08:51.766654 76 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 59  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› tripped: failed to
connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›
I210308 16:08:51.766737 76 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 ⋮ [n2] 60  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› event: ‹BreakerTrip
ped›
I210308 16:08:51.766759 76 2@rpc/nodedialer/nodedialer.go:160 ⋮ [n2,ts-poll,range-lookup=‹/Meta2/System/tsd/cr.node.build.timestamp/2/10s/2021-03-08T16:00:00Z›] 61  unable
to connect to n1: failed to connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available
 in request context›

Related to #60632

cc @aaron-crl @itsbilal

The text was updated successfully, but these errors were encountered:

knz · 2021-03-08T16:18:10Z

@aaron-crl this is the specific issue that was blocking me last Monday. I can still reproduce this at will.

The description at top pertains a scenario using 2 different machines.

I can also reproduce this by using a single machine but different port numbers, for example:

mkdir -p n1 n2

(cd n2 && ../cockroach connect --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26258 --listen-addr=192.168.2.19:26259  --http-addr=:8082 --certs-dir=certs)
# in another terminal
(cd n1 && ../cockroach connect --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26259 --listen-addr=192.168.2.19:26258  --http-addr=:8081 --certs-dir=certs

# the connect command succeeds; now run the servers:

(cd n2 && ../cockroach start --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26259 --listen-addr=192.168.2.19:26258  --http-addr=:8081 --certs-dir=certs)
# in another terminal
(cd n1 && ../cockroach start --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26258 --listen-addr=192.168.2.19:26259  --http-addr=:8082 --certs-dir=certs)

Observe the log files: one node can connect to the other, but the other one complains when it tries to connect back:

outgoing join rpc to ‹192.168.2.19:26259› unsuccessful: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›

itsbilal · 2021-04-08T21:21:14Z

I can also repro this, and also with the node-join stuff. Might take a look as Aaron is away at the moment.

Previously, non-trust-leader nodes couldn't connect back to the trust leader due to the presence of the wrong `ca-client.crt` on their disk; the main CA cert/key was being written in four places. This change fixes that bug, and also creates a new `client.node.crt` certificate to prevent other subsequent errors from being thrown. Fixes cockroachdb#61624. Release note: None.

…nnect` The end-to-end test for the new `connect` command was incomplete, because of issue cockroachdb#61624 that was blocking the functionality. Now that cockroachdb#63589 is in, we can add the missing test. Release note: None

63589: server, security: Fix one-way connectivity with connect cmd r=knz a=itsbilal Informs #60632. Previously, non-trust-leader nodes couldn't connect back to the trust leader due to the presence of the wrong `ca-client.crt` on their disk; the main CA cert/key was being written in four places. This change fixes that bug, and also creates a new `client.node.crt` certificate to prevent other subsequent errors from being thrown. Fixes #61624. Release note: None. 63672: kvserver: fix write below closedts bug r=andreimatei a=andreimatei This patch fixes a bug in our closed timestamp management. This bug was making it possible for a command to close a timestamp even though other requests writing at lower timestamps are currently evaluating. The problem was that we were assuming that, if a replica is proposing a new lease, there can't be any requests in flight and every future write evaluated on the range will wait for the new lease and the evaluate above the lease start time. Based on that reasoning, the proposal buffer was recording the lease start time as its assignedClosedTimestamp. This was matching what it does for every write, where assignedClosedTimestamp corresponds to the the closed timestamp carried by the command. It turns out that the replica's reasoning was wrong. It is, in fact, possible for writes to be evaluating on the range when the lease acquisition is proposed. And these evaluations might be done at timestamps below the would-be lease's start time. This happens when the replica has already received a lease through a lease transfer. The transfer must have applied after the previous lease expired and the replica decided to start acquiring a new one. This fixes one of the assertion failures seen in #62655. Release note (bug fix): A bug leading to crashes with the message "writing below closed ts" has been fixed. 63756: backupccl: reset restored jobs during cluster restore r=dt a=pbardea Previously, jobs were restored without modification during cluster restore. Due to a recently discovered bug where backup may miss non-transactional writes written to offline spans by these jobs, their progress may no longer be accurate on the restored cluster. IMPORT and RESTORE jobs perform non-transactional writes that may be missed. When a cluster RESTORE brings back these OFFLINE tables, it will also bring back its associated job. To ensure the underlying data in these tables is correct, the jobs are now set in a reverting state so that they can clean up after themselves. In-progress schema change jobs that are affected, will fail upon validation. Release note (bug fix): Fix a bug where restored jobs may have assumed to have made progress that was not captured in the backup. The restored jobs are now either canceled cluster restore. 63837: build: update the go version requirement for `make` r=otan a=knz Fixes #63837. The builder image already requires go 1.15.10. This patch modifies the check for a non-builder `make` command to require at least the same version. Release note: None Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

…nnect` The end-to-end test for the new `connect` command was incomplete, because of issue cockroachdb#61624 that was blocking the functionality. Now that cockroachdb#63589 is in, we can add the missing test. Release note: None

63846: cli/interactive_tests: complete the end-to-end test for `cockroach connect` r=itsbilal a=knz The end-to-end test for the new `connect` command was incomplete, because of issue #61624 that was blocking the functionality. Now that #63589 is in, we can add the missing test. Release note: None 63921: schemaexpr: fix data race in ProcessColumnSet r=adityamaru a=postamar This commit fixes a data race introduced by my recent changes tracked under #63755, involving the generalized use of catalog.Column instead of descpb.ColumnDescriptor. Fixes #63907 Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Marius Posta <marius@cockroachlabs.com>

A couple changes in this commit: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624. Release note: None.

This change adds a new CLI command, `connect join`, that lets a new node retrieve CA certificates off of an existing secure cluster and bootstrap its own TLS certificates for joining that cluster. This is achieved through the consumption of a join token stored in the join_tokens table. Join tokens can be created using the `create_join_tokens()` sql builtin function, added in a previous change. The previous `connect` command has now been renamed to `connect init`. A new set of GRPC endpoints have been created to handle the server side of this change; to support retrieval of CAs as well as entire initialization cert bundles. The cert bundles with CAs and private keys aren't sent over the wire until a node join token has been verified and consumed. The receiving node (the one running `connect join`) then stores them in its SSLCertsDir and bootstraps its own certificates off of it. This feature is hidden behind a feature flag. A couple other changes in this commit to clean up related code in the `security` package: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Pass around raw byte certificates in auto_tls_init.go instead of doing excess PEM encodings and decodings. - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624. Release note (cli change): Rename `connect` to `connect init`, and add `connect join` command to retrieve certificates from an existing secure cluster and setup a new node to connect with it. Co-authored-by: Aaron Blum <aaron@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>

63168: ALTER SCHEMA, CREATE SCHEMA, DROP SCHEMA diagrams r=ericharmeling a=ericharmeling Release justification: non-production code changes Release note: None 63492: server, security: Token-based add/join TLS functionality r=knz a=itsbilal Informs #60632. This change adds a new CLI command, `connect join`, that lets a new node retrieve CA certificates off of an existing secure cluster and bootstrap its own TLS certificates for joining that cluster. This is achieved through the consumption of a join token stored in the join_tokens table. Join tokens can be created using the `create_join_tokens()` sql builtin function, added in a previous change. The previous `connect` command has now been renamed to `connect init`. A new set of GRPC endpoints have been created to handle the server side of this change; to support retrieval of CAs as well as entire initialization cert bundles. The cert bundles with CAs and private keys aren't sent over the wire until a node join token has been verified and consumed. The receiving node (the one running `connect join`) then stores them in its SSLCertsDir and bootstraps its own certificates off of it. This feature is hidden behind a feature flag. A couple other changes in this commit to clean up related code in the `security` package: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Pass around raw byte certificates in auto_tls_init.go instead of doing excess PEM encodings and decodings. - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes #61624. Release note (cli change): Rename `connect` to `connect init`, and add `connect join` command to retrieve certificates from an existing secure cluster and setup a new node to connect with it. Co-authored-by: Aaron Blum <aaron@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cli A-security labels Mar 8, 2021

knz added A-cli-admin CLI commands that pertain to controlling and configuring nodes and removed A-cli labels Mar 20, 2021

itsbilal self-assigned this Apr 12, 2021

itsbilal mentioned this issue Apr 12, 2021

server, security: Token-based add/join TLS functionality #63492

Merged

itsbilal mentioned this issue Apr 13, 2021

server, security: Fix one-way connectivity with connect cmd #63589

Merged

knz mentioned this issue Apr 19, 2021

cli/interactive_tests: complete the end-to-end test for cockroach connect #63846

Merged

craig bot closed this as completed in 60dba10 Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: TLS certs generated via `connect` cause one-way connectivity problem #61624

cli: TLS certs generated via `connect` cause one-way connectivity problem #61624

knz commented Mar 8, 2021

knz commented Mar 8, 2021 •

edited

Loading

itsbilal commented Apr 8, 2021

cli: TLS certs generated via connect cause one-way connectivity problem #61624

cli: TLS certs generated via connect cause one-way connectivity problem #61624

Comments

knz commented Mar 8, 2021

knz commented Mar 8, 2021 • edited Loading

itsbilal commented Apr 8, 2021

cli: TLS certs generated via `connect` cause one-way connectivity problem #61624

cli: TLS certs generated via `connect` cause one-way connectivity problem #61624

knz commented Mar 8, 2021 •

edited

Loading