-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long-lived application occasionally dies with SIG_PIPE when calling httpclient.request #9867
Comments
huh, it's SIGPIPE'ing on a Anyway, please follow the advice here: https://stackoverflow.com/a/108192/492186. As a workaround you can ignore all SIGPIPEs in your program, as a proper fix I guess we need to set some OpenSSL flag or something. |
Heya dom96, I tried that, like so: from posix import signal, SIG_PIPE, SIG_IGN
signal(SIG_PIPE, SIG_IGN) It seemed to have worked for a good while, but today it fell over with this (one exception per line):
All future requests then also failed on the assert |
Hrm, would be nice to get stack traces for those exceptions. I'm guessing
httpclient doesn't handle the socket getting closed correctly.
…On Tue, 11 Dec 2018 at 09:12, niv ***@***.***> wrote:
Heya dom96, I tried that, like so:
from posix import signal, SIG_PIPE, SIG_IGNsignal(SIG_PIPE, SIG_IGN)
It seemed to have worked for a good while, but today it fell over with
this (one exception per line):
Broken pipe Additional info: IO error has occurred in the BIO layer; will retry
not isClosed(socket) Cannot `send` on a closed socket; will retry
Broken pipe Additional info: IO error has occurred in the BIO layer; not retrying anymore
All future requests then also failed on the assert not isClosed and I had
to bump the process to get functionality back.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9867 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPDe--Q2uRyV0armpnf40XAprhuK5nDks5u33b6gaJpZM4ZCheN>
.
|
Ah, sorry. Don't have those anymore. I've also added workaround code now that force-closes the httpclient after reach request chain, so that the production process doesn't hang itself. |
This is a double-close issue that I'm also seeing in production. It's not an OpenSSL issue, as the error sometimes is On The code on which the error occurs looks like this: var httpClient = newAsyncHttpClient(userAgent = ctx.userAgent)
# headers omitted
let respFuture = httpClient.request(ctx.endpoint, httpMethod = HttpPOST, body = reqBody)
yield respFuture
if respFuture.failed:
httpClient.close() # Double close here!
raise respFuture.readError() A simple fix would be to add |
So apparently the scenario is this:
So either socket.close() should be made to never raise or else any code that calls it must handle the exceptions. If it raises, it should clean up the state as much as possible (e.g. set SSL_shutdown can be made to always succeed by calling SSL_CTX_set_quiet_shutdown, according to the SSL_shutdown documentation. |
Another important line from SSL_shutdown() doc:
So we should also track whether a fatal error occurred on the connection and not call SSL_shutdown in such case. If the above condition is violated, SSL_shutdown may raise even if SSL_CTX_set_quiet_shutdown() was called. |
This implies that the original exception raised by SslShutdown is getting caught somewhere, if so where? |
It results in a failed future returned from request(). Regardless, AsyncHttpClient.close() is expected to be always safe to call, and here it gets in a state where calling it results in a crash or corrupted memory. |
@dom96 Here is a stack trace for this issue with nim 1.0.4 and SIG_PIPE ignored (per niv's comments):
|
This (closing httpclient) is causing my web service with thousands of users to occasionally crash, sometimes more often than others. I tried @endragor's suggestions but it didn't seem to help. Error messages like this:
|
I'd give it another try after my openssl changes (#13919 ), this seems to be somewhat related so it's possible it was fixed. |
That PR doesn't seem to affect this bug. The bug is not about OpenSSL, it's about Nim code on higher level - namely:
See my old comments here - I pinpointed where the problems are. |
Another error message: |
I have a Django / Gunicorn / Nginx server and decided to write some speed critical parts in Nim. I created the .so files with Nimpy and after testing them thoroughly they seem to work well. However, when I use them as part of the server loop in production, I often get a I was looking for information about this problem which I attributed to Python at first and ended up here. So I removed all the Nim code from the server and now it runs without issues. Therefore I guess this is the same issue you are mentioning in this thread but in this case the server is not written in Nim. In fact I just have a couple of Nim files with one (very simple) function each. Please let me know if you want to see some specific logs / stack traces to address this issue. |
Oh, yes, please! |
I've simplified the stack to test it in an easier way. I've removed nginx and gunicorn and run the Python app directly in Django's development server. Also I've removed all calls to nim generated .so modules from the Python code with the exception of a single function. With this arrangement the server keeps crashing. An it does so very early, either on page load or after interacting with the site once or twice. To get some stacktraces, the first thing I tried is to wrap the Python function that calls the nim .so in try-catches with no luck. Server execution stops with a "SIGPIPE: Pipe closed" message and I am not able to catch and/or print any error. This is an example of the Python development server output:
Then I tried to compile the nim code with the stackTrace flag:
In these case I get the following output from the server before crashing (instead of the SIGPIPE: Pipe closed):
Where run_local.sh is just a bash script that compiles webpack and starts the Django development server with "python manage.py runserver". So again, not much information. If you can tell me which specific logs / stack traces you'd like to see and how to get them I'm happy to provide them. |
@tomaso909 trying to minimize your Nim code to be able to share it would really help a lot, or also trying to replicate it without Django (maybe by using the .so file from another Nim program) |
The nim code is very small already. Here I attach the
This function simply transforms a vector. It does not do anything network related. |
@tomaso909 I can replicate it with a simple Django app, will try to understand why that happens, thanks for the info :) |
@tomaso909 I think it might be not related to Nim at all - maybe it's the Django's fault really? I can reproduce it when using |
@Yardanico I see your point. But if I do not import the Nim function in Django the server works without problems. So there is an interaction there that causes the error. I posted the problem here because it looks the same you have in pure Nim servers and I was not able to find information for Django on this exact error. |
@Yardanico @Araq If my case is interesting to you I can prepare a small reproducible example with a very minimal Django server. On the other hand, if you think this is more of a Django issue (or Nimpy perhaps) I can ask elsewhere. |
This is not a Django issue. The cause of this bug is described in #9867 (comment) With a bit effort you can simulate it by raising error on a call to socket.close(). https://github.com/nim-lang/Nim/blob/devel/lib/pure/asyncnet.nim#L721 - replace this line with raising the exception unconditionally and then call socket.close() twice. That is the essence of the problem. The second call results in corrupted memory, because SslShutdown and SSLFree are called second time on the pointers that are already freed. There is The AsyncSocket and SSL APIs in Nim are in general very fragile and easy to misuse, creating leaks and other bugs. For example, you may call wrapSocket() on a socket that was already wrapped and previous pointers will be leaked. Thorough review of the API is necessary, with exceptional cases being documented and handled where necessary. At the moment I wouldn't suggest using these APIs in production applications - they are ridden with bugs even if you use them correctly, and you may not get a timely error when you misuse them. |
This IMO qualifies this as a showstopper. I'll let Araq change it if he disagress but otherwise we could use some support to fix it. |
I am using ssl + asynchttpclient and getting a segfault from a long-running application, last one was after 2 days.
Fair to assume that is a symptom of the underlying issue here? There's some mentions of possible work on this, I used 1.3.5 devel (don't have the git hash, but can re-compile with another release and debugging/stacktracing on). |
Please recompile with stack trace on. My guess would be that |
Also I'd recommend that everyone running their Nim program as a SSL server should install |
Will do |
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors coming from a close() after a bad event (ie. in defer blocks). Ref nim-lang#9867
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors coming from a close() after a bad event (ie. in defer blocks). Ref nim-lang#9867
Workaround nim nim-lang/Nim#9867
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors coming from a close() after a bad event (ie. the other end of the pipe is closed before shutdown can be negotiated). Ref nim-lang#9867
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors coming from a close() after a bad event (ie. the other end of the pipe is closed before shutdown can be negotiated). Ref nim-lang#9867
…15066) * asyncnet, net: don't attempt SSL_shutdown if a fatal error occurred Per TLS standard and SSL_shutdown(3ssl). This should prevent errors coming from a close() after a bad event (ie. the other end of the pipe is closed before shutdown can be negotiated). Ref #9867 * tssl: try sending until an error occur * tssl: cleanup * tssl: actually run the test I forgot to make the test run :P * tssl: run the test on ARC, maybe then it'll be happy * tssl: turns off ARC, switch tlsEmulation on for freebsd * tssl: document why tlsEmulation is employed * net: move SafeDisconn handling logic to socketError
don't mean to necro, but the reason I haven't posted anything is, since that message, is because I had pulled latest devel, and recompiled with stack tracing on etc... and it hasn't crashed since. I will probably recompile with the currently latest devel, and notify back if it runs into issues going forward. |
Please do :) And thanks for letting me know that previous patches does manage to prevent crashes in certain configurations. With the latest fix, at least for the kind of errors in the first post, is fully resolved. However only in real-world testing would we know if this is the case 100% of the time. |
Seems to be segfaulting on something else now, after using latest nightly compiler
gdb coredump backtrace doesn't yield anything useful, I may try another nightly, as this appears a seperate problem, but still adding here for posterity and in case I'm mistaken. |
@D-Nice can you try with --gc:arc ? |
@Yardanico the project imports EDIT: Forgot to mention, it just crashed again after 3ish hours. |
I am attempting other GCs, and have previously run this app with boehm, without issues, but on latest devel, boehm now crashes very shortly after initialization. First crash with boehm, compressed bin with upx
Second crash with boehm, uncompressed bin
gdb backtrace of second crash
The latest nightly I'm referring to is:
|
As a note, similar crash still in boehm, with 1.2.6. I could swear I ran it fine with some earlier nim 1.2.x, I'll investigate more there. markAndSweep won't compile at all with nightly for my app, getting Running it for now with 1.2.6 and M&S gc to see how it fares over time. |
This is interesting. Can you enable debug information (with Also I'm seeing
in one of your logs. This is really odd as this bug should have been completely fixed in devel. Can you confirm your version with |
For that SSL error you've mentioned, I can confirm it was I'm looking to create a minimal reproducible example, and see if I can get same issues going there. |
Compiled with recent devel: Compile flags for the code: backtrace with native debugger flag:
logs
|
This seems to be caused by a double close. I've PR-ed a (hack?) fix in #15174. A proper fix will be done once I got some insight into some weird design within |
…im-lang#15066) * asyncnet, net: don't attempt SSL_shutdown if a fatal error occurred Per TLS standard and SSL_shutdown(3ssl). This should prevent errors coming from a close() after a bad event (ie. the other end of the pipe is closed before shutdown can be negotiated). Ref nim-lang#9867 * tssl: try sending until an error occur * tssl: cleanup * tssl: actually run the test I forgot to make the test run :P * tssl: run the test on ARC, maybe then it'll be happy * tssl: turns off ARC, switch tlsEmulation on for freebsd * tssl: document why tlsEmulation is employed * net: move SafeDisconn handling logic to socketError
Our serverside application, currently on 0.18, does http the occasional http request to the Google APIs. Sometimes, I think, google closes the connection on us and the application dies with SIG_PIPE:
Of course, the http request() is wrapped into try/except; that does not catch the signal however.
This is running on nim 0.18 (I can't try 0.19 yet, porting it is significant effort due to libraries used), and in a docker container on Linux.
I found this thread: https://forum.nim-lang.org/t/4417, but that didn't really help. The httpclient is already configured to use a reasonable timeout (3s).
I'm fairly certain this is because Google is closing the kept-alive connection in the background; but I'm hesitant to add
Connection: Close
(or otherwise close the connection on our end) unless that's the only way to solve this.The text was updated successfully, but these errors were encountered: