-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to OpenSSL error handling #112
Conversation
We must shutdown the socket after any SSL exception, including those we've translated into a SYSERR. For other exceptions we should shutdown the socket as well unless the errors are known to occur at a point it is safe (no packets are left in flight).
rb_trilogy_close always succeeds, and we don't check or raise errors. However trilogy_close_send, flush_writes, or trilogy_close_recv may push an OpenSSL error onto the error queue, so we must clear the queue.
OpenSSL can return 0 as an error code. Previously on a read we would turn that into a CONNECTION_CLOSED, which may be accurate, but we should check for the specific conditions OpenSSL's documentation describes. As per https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html > The SSL_ERROR_SYSCALL with errno value of 0 indicates unexpected EOF > from the peer. This will be properly reported as SSL_ERROR_SSL with > reason code SSL_R_UNEXPECTED_EOF_WHILE_READING in the OpenSSL 3.0 > release because it is truly a TLS protocol error to terminate the > connection without a SSL_shutdown(). > > The issue is kept unfixed in OpenSSL 1.1.1 releases because many > applications which choose to ignore this protocol error depend on the > existing way of reporting the error. I think trilogy would be correct in turning this into an OpenSSL error (as will happen on openssl 3.0) but for improved error reporting and restoring closer to previous behaviour we can treat this as a connection closed. Additionally, this now only translates an exception to SSL_ERROR_SYSCALL if there is nothing in the error queue. I believe that's always the case but I couldn't find it stated explicitly in the documentation so let's test for it rather than assuming.
Previously we would inspect errno inside of client.c to try to return TRILOGY_AGAIN in cases it seemed appropriate. However on an SSL socket (or any other hypothetical generic socket) we really should not retry unless OpenSSL tells us to. This moves the logic into _cb_raw_*, so that it only applies to raw sockets.
If we see an EPIPE via an OpenSSL error, we should translate it to a TRILOGY_CLOSED_CONNECTION to match previous behaviour. I think QueryError: TRILOGY_CLOSED_CONNECTION makes very little sense as an error, but that's something we should tackle separately.
Libraries using OpenSSL should always clear any errors they create from the global error queue, however it's a very easy mistake to make, and hard to diagnose the true source. This commit clears any errors in the thread local queue before attempting SSL_read/SSL_write operations, as any errors left in the queue will cause SSL_get_error to misinterpret WANT_READ/WANT_WRITE as whatever error was left in the queue. Other libraries seem to do this. Libcurl as one example applies this same workaround (ironically libcurl is also the library we are seeing misbehaving and not clearing their errors properly). libpq is another example.
// Backwards compatibility: This error class makes no sense, but matches legacy behavior | ||
rb_raise(Trilogy_QueryError, "%" PRIsVALUE ": TRILOGY_CLOSED_CONNECTION", msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that I didn't want to change this exception in this PR, this is the exact exception that would have previously been raised: QueryError is our fallback and is what was raised by a TRILOGY_CLOSED_CONNECTION
, which is what EPIPE was previously translated into.
Let's discuss separately how we can improve exceptions as I'm sure there is discussion to be had.
With improved error handling, under OpenSSL > 3 we now see this error instead of TRILOGY_CLOSED_CONNECTION.
Pushed one extra commit to allow I think that makes sense to allow as the most specific error we have, but could see an argument that we should detect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intricacies of OpenSSL sockets, oof 😅 These changes look good to me 👍
Re
I think that makes sense to allow as the most specific error we have, but could see an argument that we should detect SSL_R_UNEXPECTED_EOF_WHILE_READING and translate it to the same error we'd see from TRILOGY_CLOSED_CONNECTION. What do y'all think?
I'm in favour of raising this as a Trilogy::SSLError
over a more generic closed connection error, given that OpenSSL is opting to treat it as an SSL protocol error (ie. from the docs, it's described as a protocol error, vs just attempting to communicate over a socket that's been shutdown, resulting in an EOF error). Are there concerns about the error class changing from OpenSSL 1.1.1 to OpenSSL 3?
Co-authored-by: Adrianna Chang <adrianna.chang@shopify.com>
I'm not especially concerned, but we should probably update Rails to consider it a retryable error. |
This fixes a number of issues with our OpenSSL integration.
First this ensures that the socket is shutdown (or closed post-#111) after error from OpenSSL. Previously we only closed the connection on
TRILOGY_OPENSSL_ERR
but not on aTRILOGY_CLOSED_CONNECTION
orTRILOGY_SYSERR
which could be returned/translated from an error in OpenSSL. As SSL holds its own buffers we must not attempt anything (includingSSL_shutdown
) after a failure fromSSL_read
orSSL_write
.To address that this also makes ALL exceptions (other than
TRILOGY_ERR
andTRILOGY_MAX_PACKET_EXCEEDED
) shutdown the connection. These exceptions don't generally indicate something recoverable (if we get an error from a syscall other than EAGAIN, which we handle specially, we're likely to keep seeing it) and even if it was recoverable these exceptions happen in while we're in an inconsistent state (writing, or while expecting more data from the server), attempts to send a new packet from the start will be invalid.Next this moves some translation based on
errno
fromclient.c
to make it specific to raw sockets. I believe an OpenSSL socket will returnSSL_ERROR_WANT_WRITE
orSSL_ERROR_WANT_READ
in any case the underlying socket returns EAGAIN, but that's not an assumption we should be making on the return value. OpenSSL has its own buffers and we can only retry when it tells us to. This also moves the translation of EPIPE to the raw socket, but also special cases it intrilogy_syserr_fail_str
for backwards compatibility.client.c
should be threating the socket like a black box not be assuming that it is backed by system calls.Next this fixes checks on the return value from
SSL_read
andSSL_write
. A return value of 0 is an error. Previously we incorrectly assumed that return value held the same meaning as fromread
orwrite
, which is incorrect. Usually this happened on an unexpected EOF from the socket (which we'd happen to treat correctly, assuming the same meaning asread
), but looking through the OpenSSL source 0 can mean a number of other errors. Instead this now adds a check for the case described in the SSL_get_error's man page "BUG" section.Additionally this clears any SSL errors inside
rb_trilogy_close
. We don't check whethertrilogy_close_send
/trilogy_close_recv
return an error, which is (probably) fine: we're closing the socket, but we must clear any OpenSSL errors which are left on the stack from that read/write operation.Finally, this adds a call to
ERR_clear_error()
before every OpenSSL operation. This is a workaround due to other libraries which incorrectly leave unhandled errors in the global (thread-local) OpenSSL error queue. This is a workaround other libraries also seem to have adopted (ones I know of: libpq and libcurl, the later of which ironically is responsible for causing this problem).I would be happy to not look at OpenSSL for a while after this 😅
cc @adrianna-chang-shopify @casperisfine @composerinteralia @matthewd
Special thanks to @pudiva for the deep investigation into the source of the unhandled errors on the SSL queue ❤️. Which revealed that some of the errors we were seeing were due to mistakes in Trilogy (insufficiently shutting down our connections after errors) and others were from other libraries (specifically libcurl, requiring clearing the queue before attempting reads/writes).