Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: TLS certs generated via connect cause one-way connectivity problem #61624

Closed
knz opened this issue Mar 8, 2021 · 2 comments · Fixed by #63492 or #63589
Closed

cli: TLS certs generated via connect cause one-way connectivity problem #61624

knz opened this issue Mar 8, 2021 · 2 comments · Fixed by #63492 or #63589
Assignees
Labels
A-cli-admin CLI commands that pertain to controlling and configuring nodes A-security C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@knz
Copy link
Contributor

knz commented Mar 8, 2021

Here is what I used:

  1. on machine 192.168.2.10 I ran the following command:

    ./cockroach connect --num-expected-initial-nodes 2 --init-token abc --listen-addr=192.168.2.10 --join=192.168.2.19:26258

  2. on machine 192.168.2.19 I ran the following:

    ./cockroach connect --num-expected-initial-nodes 2 --init-token abc --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081

This made the connect command complete successfully.

Note: beware of mentioning the port number in --join (because of issue #61620) and explicit IP addresses in --listen-addr (because of issues #61619 and #61616)

  1. Then as recommended by the connect command I ran the following, which worked:

    ./cockroach cert create-client root --ca-key=~/.cockroach-certs/ca-client.key

Then I started my CockroachDB nodes:

  1. on machine 192.168.2.10:

    ./cockroach start --join=192.168.2.19:26258 --listen-addr=192.168.2.10

  2. on machine 192.168.2.19:

    ./cockroach start --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081

  3. Here I start observing in logs something that is unexpected / undesirable (first symptom of the problem):

    • on 192.168.2.10 logs are fine:

      I210308 15:50:28.243256 206 server/init.go:420 ⋮ [n?] 28 ‹192.168.2.19:26258› is itself waiting for init, will retry

      This indicates that this server is able to establish an outgoing RPC conn to the other one.

    • on 192.168.2.19, we see the problem:

      W210308 16:05:46.896090 150 server/init.go:422 ⋮ [n?] 41 outgoing join rpc to ‹192.168.2.10:26257› unsuccessful: ‹rpc error: code = Unauthenticated desc = TLSInfo is not a vailable in request context›

      This indicates that this server is unable to establish its outgoing RPC conn to the other one.

  4. at this point I was suspecting that maybe the init RPC is special and uses a different TLS configuration that an already-initialized server. So I ran the following, which worked without errors:

    • on 192.168.2.10: ./cockroach start --join=192.168.2.19:26258 --listen-addr=192.168.2.10 --insecure
    • on 192.168.2.19: ./cockroach start --join=192.168.2.10:26257 --listen-addr=192.168.2.19:26258 --http-addr=:8081 --insecure
    • anywhere: ./cockroach init --host=... --port=... --insecure

    This initializes the cluster and assigns node ID without connectivity errors.

  5. Then I re-start the already-initialized servers, using the same commands as previously. Then ISee:

    • on 192.168.2.10 I see the following errors:
E210308 16:09:11.689777 446 kv/kvserver/consistency_queue.go:191 ⋮ [n1,consistencyChecker,s1,r22/1:‹/Table/2{6-7}›] 123  computing own checksum: could not dial node ID 1: u
nable to dial n1: ‹breaker open›
E210308 16:09:11.689856 446 kv/kvserver/queue.go:1093 ⋮ [n1,consistencyChecker,s1,r22/1:‹/Table/2{6-7}›] 124  computing own checksum: could not dial node ID 1: unable to di
al n1: ‹breaker open›
E210308 16:09:13.688958 1331 kv/kvserver/consistency_queue.go:191 ⋮ [n1,consistencyChecker,s1,r7/1:‹/Table/1{1-2}›] 125  computing own checksum: could not dial node ID 1: f
ailed to connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context
›
  • on 192.168.2.19 I see the following errors:
W210308 16:08:49.464792 132 vendor/google.golang.org/grpc/internal/channelz/logging.go:73 ⋮ [-] 55  ‹grpc: addrConn.createTransport failed to connect to {192.168.2.10:26257
  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...›
I210308 16:08:51.678239 257 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 56  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› tripped: failed to
 connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›
I210308 16:08:51.678323 257 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 ⋮ [n2] 57  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› event: ‹BreakerTri
pped›
I210308 16:08:51.678351 257 2@rpc/nodedialer/nodedialer.go:160 ⋮ [ct-client] 58  unable to connect to n1: failed to connect to n1 at ‹192.168.2.10:26257›: ‹initial connecti
on heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›
I210308 16:08:51.766654 76 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 59  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› tripped: failed to
connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›
I210308 16:08:51.766737 76 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 ⋮ [n2] 60  circuitbreaker: ‹rpc 192.168.2.19:26258 [n1]› event: ‹BreakerTrip
ped›
I210308 16:08:51.766759 76 2@rpc/nodedialer/nodedialer.go:160 ⋮ [n2,ts-poll,range-lookup=‹/Meta2/System/tsd/cr.node.build.timestamp/2/10s/2021-03-08T16:00:00Z›] 61  unable
to connect to n1: failed to connect to n1 at ‹192.168.2.10:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available
 in request context›

Related to #60632

cc @aaron-crl @itsbilal

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cli A-security labels Mar 8, 2021
@knz
Copy link
Contributor Author

knz commented Mar 8, 2021

@aaron-crl this is the specific issue that was blocking me last Monday. I can still reproduce this at will.

The description at top pertains a scenario using 2 different machines.

I can also reproduce this by using a single machine but different port numbers, for example:

mkdir -p n1 n2

(cd n2 && ../cockroach connect --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26258 --listen-addr=192.168.2.19:26259  --http-addr=:8082 --certs-dir=certs)
# in another terminal
(cd n1 && ../cockroach connect --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26259 --listen-addr=192.168.2.19:26258  --http-addr=:8081 --certs-dir=certs

# the connect command succeeds; now run the servers:

(cd n2 && ../cockroach start --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26259 --listen-addr=192.168.2.19:26258  --http-addr=:8081 --certs-dir=certs)
# in another terminal
(cd n1 && ../cockroach start --num-expected-initial-nodes 2 --init-token 2 --join=192.168.2.19:26258 --listen-addr=192.168.2.19:26259  --http-addr=:8082 --certs-dir=certs)

Observe the log files: one node can connect to the other, but the other one complains when it tries to connect back:

outgoing join rpc to ‹192.168.2.19:26259› unsuccessful: ‹rpc error: code = Unauthenticated desc = TLSInfo is not available in request context›

@knz knz added A-cli-admin CLI commands that pertain to controlling and configuring nodes and removed A-cli labels Mar 20, 2021
@itsbilal
Copy link
Member

itsbilal commented Apr 8, 2021

I can also repro this, and also with the node-join stuff. Might take a look as Aaron is away at the moment.

@itsbilal itsbilal self-assigned this Apr 12, 2021
itsbilal added a commit to itsbilal/cockroach that referenced this issue Apr 13, 2021
Previously, non-trust-leader nodes couldn't connect back
to the trust leader due to the presence of the wrong
`ca-client.crt` on their disk; the main CA cert/key was
being written in four places.

This change fixes that bug,
and also creates a new `client.node.crt` certificate to prevent
other subsequent errors from being thrown.

Fixes cockroachdb#61624.

Release note: None.
knz added a commit to knz/cockroach that referenced this issue Apr 19, 2021
…nnect`

The end-to-end test for the new `connect` command was incomplete,
because of issue cockroachdb#61624 that was blocking the functionality.

Now that cockroachdb#63589 is in, we can add the missing test.

Release note: None
craig bot pushed a commit that referenced this issue Apr 19, 2021
63589: server, security: Fix one-way connectivity with connect cmd r=knz a=itsbilal

Informs #60632.

Previously, non-trust-leader nodes couldn't connect back
to the trust leader due to the presence of the wrong
`ca-client.crt` on their disk; the main CA cert/key was
being written in four places.

This change fixes that bug,
and also creates a new `client.node.crt` certificate to prevent
other subsequent errors from being thrown.

Fixes #61624.

Release note: None.

63672: kvserver: fix write below closedts bug  r=andreimatei a=andreimatei

This patch fixes a bug in our closed timestamp management. This bug was
making it possible for a command to close a timestamp even though other
requests writing at lower timestamps are currently evaluating. The
problem was that we were assuming that, if a replica is proposing a new
lease, there can't be any requests in flight and every future write
evaluated on the range will wait for the new lease and the evaluate
above the lease start time. Based on that reasoning, the proposal buffer
was recording the lease start time as its assignedClosedTimestamp. This
was matching what it does for every write, where assignedClosedTimestamp
corresponds to the the closed timestamp carried by the command.

It turns out that the replica's reasoning was wrong. It is, in fact,
possible for writes to be evaluating on the range when the lease
acquisition is proposed. And these evaluations might be done at
timestamps below the would-be lease's start time. This happens when the
replica has already received a lease through a lease transfer. The
transfer must have applied after the previous lease expired and the
replica decided to start acquiring a new one.

This fixes one of the assertion failures seen in #62655.

Release note (bug fix): A bug leading to crashes with the message
"writing below closed ts" has been fixed.

63756: backupccl: reset restored jobs during cluster restore r=dt a=pbardea

Previously, jobs were restored without modification during cluster
restore. Due to a recently discovered bug where backup may miss
non-transactional writes written to offline spans by these jobs, their
progress may no longer be accurate on the restored cluster.

IMPORT and RESTORE jobs perform non-transactional writes that may be
missed. When a cluster RESTORE brings back these OFFLINE tables, it will
also bring back its associated job. To ensure the underlying data in
these tables is correct, the jobs are now set in a reverting state so
that they can clean up after themselves.

In-progress schema change jobs that are affected, will fail upon
validation.

Release note (bug fix): Fix a bug where restored jobs may have assumed
to have made progress that was not captured in the backup. The restored
jobs are now either canceled cluster restore.

63837: build: update the go version requirement for `make` r=otan a=knz

Fixes #63837.

The builder image already requires go 1.15.10. This patch
modifies the check for a non-builder `make` command to require
at least the same version.

Release note: None

Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Paul Bardea <pbardea@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@craig craig bot closed this as completed in 60dba10 Apr 19, 2021
knz added a commit to knz/cockroach that referenced this issue Apr 19, 2021
…nnect`

The end-to-end test for the new `connect` command was incomplete,
because of issue cockroachdb#61624 that was blocking the functionality.

Now that cockroachdb#63589 is in, we can add the missing test.

Release note: None
craig bot pushed a commit that referenced this issue Apr 20, 2021
63846: cli/interactive_tests: complete the end-to-end test for `cockroach connect` r=itsbilal a=knz

The end-to-end test for the new `connect` command was incomplete,
because of issue #61624 that was blocking the functionality.

Now that #63589 is in, we can add the missing test.

Release note: None

63921: schemaexpr: fix data race in ProcessColumnSet r=adityamaru a=postamar

This commit fixes a data race introduced by my recent changes tracked
under #63755, involving the generalized use of catalog.Column instead of
descpb.ColumnDescriptor.

Fixes #63907

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Marius Posta <marius@cockroachlabs.com>
itsbilal added a commit to itsbilal/cockroach that referenced this issue May 5, 2021
A couple changes in this commit:
 - Capitalize CA consistently in protobuf structs and method names
 - Write all CAs/certs from the right places in InitializeFromConfig
   instead of writing the InterNode one everywhere
 - Create a `client.node.crt` signed by `ca-client.crt`, otherwise
   nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue May 6, 2021
A couple changes in this commit:
 - Capitalize CA consistently in protobuf structs and method names
 - Write all CAs/certs from the right places in InitializeFromConfig
   instead of writing the InterNode one everywhere
 - Create a `client.node.crt` signed by `ca-client.crt`, otherwise
   nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue May 6, 2021
This change adds a new CLI command, `connect join`, that lets a new
node retrieve CA certificates off of an existing secure cluster
and bootstrap its own TLS certificates for joining that cluster.
This is achieved through the consumption of a join token
stored in the join_tokens table. Join tokens can be created
using the `create_join_tokens()` sql builtin function, added
in a previous change.

The previous `connect` command has now been renamed to
`connect init`.

A new set of GRPC endpoints have been created to handle the
server side of this change; to support retrieval of CAs as well
as entire initialization cert bundles. The cert bundles with
CAs and private keys aren't sent over the wire until a
node join token has been verified and consumed. The receiving
node (the one running `connect join`) then stores them in its
SSLCertsDir and bootstraps its own certificates off of it.

This feature is hidden behind a feature flag.

A couple other changes in this commit to clean up related code
in the `security` package:
 - Capitalize CA consistently in protobuf structs and method names
 - Write all CAs/certs from the right places in InitializeFromConfig
   instead of writing the InterNode one everywhere
 - Pass around raw byte certificates in auto_tls_init.go instead of
   doing excess PEM encodings and decodings.
 - Create a `client.node.crt` signed by `ca-client.crt`, otherwise
   nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624.

Release note (cli change): Rename `connect` to `connect init`,
and add `connect join` command to retrieve certificates from
an existing secure cluster and setup a new node to connect with
it.

Co-authored-by: Aaron Blum <aaron@cockroachlabs.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
craig bot pushed a commit that referenced this issue May 6, 2021
63168: ALTER SCHEMA, CREATE SCHEMA, DROP SCHEMA diagrams r=ericharmeling a=ericharmeling

Release justification: non-production code changes

Release note: None

63492: server, security: Token-based add/join TLS functionality  r=knz a=itsbilal

Informs  #60632.

This change adds a new CLI command, `connect join`, that lets a new
node retrieve CA certificates off of an existing secure cluster
and bootstrap its own TLS certificates for joining that cluster.
This is achieved through the consumption of a join token
stored in the join_tokens table. Join tokens can be created
using the `create_join_tokens()` sql builtin function, added
in a previous change.

The previous `connect` command has now been renamed to
`connect init`.

A new set of GRPC endpoints have been created to handle the
server side of this change; to support retrieval of CAs as well
as entire initialization cert bundles. The cert bundles with
CAs and private keys aren't sent over the wire until a
node join token has been verified and consumed. The receiving
node (the one running `connect join`) then stores them in its
SSLCertsDir and bootstraps its own certificates off of it.

This feature is hidden behind a feature flag.

A couple other changes in this commit to clean up related code
in the `security` package:
 - Capitalize CA consistently in protobuf structs and method names
 - Write all CAs/certs from the right places in InitializeFromConfig
   instead of writing the InterNode one everywhere
 - Pass around raw byte certificates in auto_tls_init.go instead of
   doing excess PEM encodings and decodings.
 - Create a `client.node.crt` signed by `ca-client.crt`, otherwise
   nodes wouldn't be able to connect to each other. Fixes #61624.

Release note (cli change): Rename `connect` to `connect init`,
and add `connect join` command to retrieve certificates from
an existing secure cluster and setup a new node to connect with
it.

Co-authored-by: Aaron Blum <aaron@cockroachlabs.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>

Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cli-admin CLI commands that pertain to controlling and configuring nodes A-security C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
2 participants