Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tls cert race/core 9178/v25.1.x #175

Conversation

michael-redpanda
Copy link

Fixes: CORE-9178

This addresses a possible race condition that some customers have witnessed where Seastar was not returning the correct certificate and this caused authorization failures on the broker. The fix is to use the correct OpenSSL API (SSL_get1_peer_certificate) rather than the cert held by the callback.

Note: We will backport this to v24.3 and v24.2 to address this bug in all OpenSSL implementations

@michael-redpanda michael-redpanda requested a review from a team February 21, 2025 20:37
@michael-redpanda michael-redpanda self-assigned this Feb 21, 2025
@michael-redpanda michael-redpanda requested review from pgellert and removed request for a team February 21, 2025 20:37
@michael-redpanda michael-redpanda force-pushed the tls-cert-race/core-9178/v25.1.x branch from 913465f to 96c1345 Compare February 21, 2025 20:44
@michael-redpanda
Copy link
Author

CI failures in two categories:

  • CPU profiler unit test flakiness (I believe Fixes for SG in cpu profile #174 is dealing with that)
  • Failure in building rpc_test.cc - code doesn't touch this so it's probably something that's already there

@StephanDollberg
Copy link
Member

Failure in building rpc_test.cc - code doesn't touch this so it's probably something that's already there

Should be fixed by #176

@michael-redpanda
Copy link
Author

By the end of our three PRs we may get one green run ;)

@michael-redpanda
Copy link
Author

CI Failures:

  • Modules fails to build
  • cpu_profiler still flaky

Comment on lines +1108 to +1116
auto error_codes = get_all_ossl_errors();
if (contains_ossl_error(error_codes, ERR_LIB_SSL, SSL_R_UNEXPECTED_EOF_WHILE_READING)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this pattern be used in other cases where ERR_peek_error() != 0?

In all the other cases, ERR_clear_error() is also used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's probably fair

Signed-off-by: Michael Boquard <michael@redpanda.com>
Created a new make_ossl_error function that takes in a pre-generated
list of OpenSSL error codes.  This is helpful if the error codes were
already pulled from the stack to be analyzed.

Created a helper function that drains the openssl error stack and
returns a list of errors.

Created a contains_ossl_error function that is used to analyze a list of
errors and determine if any match the library and error code.

Signed-off-by: Michael Boquard <michael@redpanda.com>
Sometimes the verification errors are not the first error in the OpenSSL
error stack.  This change will drain the error stack into a vector of
errors and then check if any of the errors match the certificate
verification error codes.  This will improve error messaging in
situations where the peer certificate fails to verify.

Signed-off-by: Michael Boquard <michael@redpanda.com>
Update remaining ERR_peek_error checks to use `get_all_ossl_errors`.

Signed-off-by: Michael Boquard <michael@redpanda.com>
This change should address a possible race condition when a using
application under heavy load attempts to view the DN of the peer's
(client's) certificate.

Once the handshake is successful, the function
SSL_get1_peer_certificate() should be used to obtain a copy of the
peer's certificate.  The verify_callback() mechanism is still used to
hold the last seen certificate that is being verified.  This is
necessary for error handling in order to inform a user application of
which certificate failed verification.

Customers have witnessed authorization failures because sometimes the
wrong certificate was returned purporting to be the client's
certificate.  This was traced to a possible race condition between
fetching the certificate and the callback function holding the wrong
certificate at the time.

Signed-off-by: Michael Boquard <michael@redpanda.com>
Without this change, our CI is testing against the GnuTLS
implementation.

Updated the install-dependencies script to include OpenSSL development
headers.

Signed-off-by: Michael Boquard <michael@redpanda.com>
More recent versions of OpenSSL are stricter about which extensions are
required.  Updated CA generation to use the `v3_ca` extensions found in
cert.cfg.in.

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda force-pushed the tls-cert-race/core-9178/v25.1.x branch from 96b4655 to 7d7c0c8 Compare February 24, 2025 14:10
@michael-redpanda
Copy link
Author

Force push 7d7c0c8:

  • Changed name of bool_class type
  • Fixed wording in tls_test.cc
  • Using new error check mechanism in place of ERR_peek_error

Copy link

@IoannisRP IoannisRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michael-redpanda michael-redpanda merged commit 534c1a8 into redpanda-data:v25.1.x Feb 24, 2025
13 of 14 checks passed
michael-redpanda added a commit to michael-redpanda/redpanda that referenced this pull request Feb 24, 2025
Updated reference to pull in fix introduced by
redpanda-data/seastar#175

Signed-off-by: Michael Boquard <michael@redpanda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants