Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to OpenSSL error handling #112

Merged
merged 10 commits into from
Aug 15, 2023

Conversation

jhawthorn
Copy link
Member

@jhawthorn jhawthorn commented Aug 9, 2023

This fixes a number of issues with our OpenSSL integration.

First this ensures that the socket is shutdown (or closed post-#111) after error from OpenSSL. Previously we only closed the connection on TRILOGY_OPENSSL_ERR but not on a TRILOGY_CLOSED_CONNECTION or TRILOGY_SYSERR which could be returned/translated from an error in OpenSSL. As SSL holds its own buffers we must not attempt anything (including SSL_shutdown) after a failure from SSL_read or SSL_write.

To address that this also makes ALL exceptions (other than TRILOGY_ERR and TRILOGY_MAX_PACKET_EXCEEDED) shutdown the connection. These exceptions don't generally indicate something recoverable (if we get an error from a syscall other than EAGAIN, which we handle specially, we're likely to keep seeing it) and even if it was recoverable these exceptions happen in while we're in an inconsistent state (writing, or while expecting more data from the server), attempts to send a new packet from the start will be invalid.

Next this moves some translation based on errno from client.c to make it specific to raw sockets. I believe an OpenSSL socket will return SSL_ERROR_WANT_WRITE or SSL_ERROR_WANT_READ in any case the underlying socket returns EAGAIN, but that's not an assumption we should be making on the return value. OpenSSL has its own buffers and we can only retry when it tells us to. This also moves the translation of EPIPE to the raw socket, but also special cases it in trilogy_syserr_fail_str for backwards compatibility. client.c should be threating the socket like a black box not be assuming that it is backed by system calls.

Next this fixes checks on the return value from SSL_read and SSL_write. A return value of 0 is an error. Previously we incorrectly assumed that return value held the same meaning as from read or write, which is incorrect. Usually this happened on an unexpected EOF from the socket (which we'd happen to treat correctly, assuming the same meaning as read), but looking through the OpenSSL source 0 can mean a number of other errors. Instead this now adds a check for the case described in the SSL_get_error's man page "BUG" section.

The SSL_ERROR_SYSCALL with errno value of 0 indicates unexpected EOF from the peer. This will be properly reported as SSL_ERROR_SSL with reason code SSL_R_UNEXPECTED_EOF_WHILE_READING in the OpenSSL 3.0 release because it is truly a TLS protocol error to terminate the connection without a SSL_shutdown().

The issue is kept unfixed in OpenSSL 1.1.1 releases because many applications which choose to ignore this protocol error depend on the existing way of reporting the error.

Additionally this clears any SSL errors inside rb_trilogy_close. We don't check whether trilogy_close_send/trilogy_close_recv return an error, which is (probably) fine: we're closing the socket, but we must clear any OpenSSL errors which are left on the stack from that read/write operation.

Finally, this adds a call to ERR_clear_error() before every OpenSSL operation. This is a workaround due to other libraries which incorrectly leave unhandled errors in the global (thread-local) OpenSSL error queue. This is a workaround other libraries also seem to have adopted (ones I know of: libpq and libcurl, the later of which ironically is responsible for causing this problem).

I would be happy to not look at OpenSSL for a while after this 😅

cc @adrianna-chang-shopify @casperisfine @composerinteralia @matthewd

Special thanks to @pudiva for the deep investigation into the source of the unhandled errors on the SSL queue ❤️. Which revealed that some of the errors we were seeing were due to mistakes in Trilogy (insufficiently shutting down our connections after errors) and others were from other libraries (specifically libcurl, requiring clearing the queue before attempting reads/writes).

We must shutdown the socket after any SSL exception, including those
we've translated into a SYSERR.

For other exceptions we should shutdown the socket as well unless the
errors are known to occur at a point it is safe (no packets are left in
flight).
rb_trilogy_close always succeeds, and we don't check or raise errors.
However trilogy_close_send, flush_writes, or trilogy_close_recv may push
an OpenSSL error onto the error queue, so we must clear the queue.
OpenSSL can return 0 as an error code. Previously on a read we would
turn that into a CONNECTION_CLOSED, which may be accurate, but we should
check for the specific conditions OpenSSL's documentation describes.

As per https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html

> The SSL_ERROR_SYSCALL with errno value of 0 indicates unexpected EOF
> from the peer. This will be properly reported as SSL_ERROR_SSL with
> reason code SSL_R_UNEXPECTED_EOF_WHILE_READING in the OpenSSL 3.0
> release because it is truly a TLS protocol error to terminate the
> connection without a SSL_shutdown().
>
> The issue is kept unfixed in OpenSSL 1.1.1 releases because many
> applications which choose to ignore this protocol error depend on the
> existing way of reporting the error.

I think trilogy would be correct in turning this into an OpenSSL error
(as will happen on openssl 3.0) but for improved error reporting and
restoring closer to previous behaviour we can treat this as a connection
closed.

Additionally, this now only translates an exception to SSL_ERROR_SYSCALL
if there is nothing in the error queue. I believe that's always the case
but I couldn't find it stated explicitly in the documentation so let's
test for it rather than assuming.
Previously we would inspect errno inside of client.c to try to return
TRILOGY_AGAIN in cases it seemed appropriate. However on an SSL socket
(or any other hypothetical generic socket) we really should not retry
unless OpenSSL tells us to.

This moves the logic into _cb_raw_*, so that it only applies to raw
sockets.
If we see an EPIPE via an OpenSSL error, we should translate it to a
TRILOGY_CLOSED_CONNECTION to match previous behaviour.

I think QueryError: TRILOGY_CLOSED_CONNECTION makes very little sense as
an error, but that's something we should tackle separately.
Libraries using OpenSSL should always clear any errors they create from
the global error queue, however it's a very easy mistake to make, and
hard to diagnose the true source.

This commit clears any errors in the thread local queue before
attempting SSL_read/SSL_write operations, as any errors left in the
queue will cause SSL_get_error to misinterpret WANT_READ/WANT_WRITE as
whatever error was left in the queue.

Other libraries seem to do this. Libcurl as one example applies this
same workaround (ironically libcurl is also the library we are seeing
misbehaving and not clearing their errors properly). libpq is another
example.
Comment on lines +98 to +99
// Backwards compatibility: This error class makes no sense, but matches legacy behavior
rb_raise(Trilogy_QueryError, "%" PRIsVALUE ": TRILOGY_CLOSED_CONNECTION", msg);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that I didn't want to change this exception in this PR, this is the exact exception that would have previously been raised: QueryError is our fallback and is what was raised by a TRILOGY_CLOSED_CONNECTION, which is what EPIPE was previously translated into.

Let's discuss separately how we can improve exceptions as I'm sure there is discussion to be had.

With improved error handling, under OpenSSL > 3 we now see this error
instead of TRILOGY_CLOSED_CONNECTION.
@jhawthorn
Copy link
Member Author

Pushed one extra commit to allow SSL Error: unexpected eof while reading inside assert_raises_connection_error, which should only occur on OpenSSL 3+ .

I think that makes sense to allow as the most specific error we have, but could see an argument that we should detect SSL_R_UNEXPECTED_EOF_WHILE_READING and translate it to the same error we'd see from TRILOGY_CLOSED_CONNECTION. What do y'all think?

Copy link
Collaborator

@adrianna-chang-shopify adrianna-chang-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intricacies of OpenSSL sockets, oof 😅 These changes look good to me 👍

Re

I think that makes sense to allow as the most specific error we have, but could see an argument that we should detect SSL_R_UNEXPECTED_EOF_WHILE_READING and translate it to the same error we'd see from TRILOGY_CLOSED_CONNECTION. What do y'all think?

I'm in favour of raising this as a Trilogy::SSLError over a more generic closed connection error, given that OpenSSL is opting to treat it as an SSL protocol error (ie. from the docs, it's described as a protocol error, vs just attempting to communicate over a socket that's been shutdown, resulting in an EOF error). Are there concerns about the error class changing from OpenSSL 1.1.1 to OpenSSL 3?

contrib/ruby/test/client_test.rb Outdated Show resolved Hide resolved
Co-authored-by: Adrianna Chang <adrianna.chang@shopify.com>
@jhawthorn
Copy link
Member Author

Are there concerns about the error class changing from OpenSSL 1.1.1 to OpenSSL 3?

I'm not especially concerned, but we should probably update Rails to consider it a retryable error.

@jhawthorn jhawthorn merged commit 24e7369 into trilogy-libraries:main Aug 15, 2023
@jhawthorn jhawthorn deleted the openssl_fixes branch August 15, 2023 22:52
@bensheldon bensheldon mentioned this pull request Jan 24, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants