Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-lived application occasionally dies with SIG_PIPE when calling httpclient.request #9867

Closed
niv opened this issue Dec 5, 2018 · 40 comments · Fixed by #15120 or #15174
Closed

Long-lived application occasionally dies with SIG_PIPE when calling httpclient.request #9867

niv opened this issue Dec 5, 2018 · 40 comments · Fixed by #15120 or #15174

Comments

@niv
Copy link
Contributor

niv commented Dec 5, 2018

Our serverside application, currently on 0.18, does http the occasional http request to the Google APIs. Sometimes, I think, google closes the connection on us and the application dies with SIG_PIPE:

...
keyprovision_google.nim(66) getAccessToken
keyprovision_google.nim(30) requestJson
keyprovision_google.nim(71) :anonymous
httpclient.nim(1196)     request
httpclient.nim(1181)     requestAux
httpclient.nim(1031)     parseResponse
httpclient.nim(883)      close
net.nim(866)             close
SIGPIPE: Pipe closed.

Of course, the http request() is wrapped into try/except; that does not catch the signal however.

This is running on nim 0.18 (I can't try 0.19 yet, porting it is significant effort due to libraries used), and in a docker container on Linux.

I found this thread: https://forum.nim-lang.org/t/4417, but that didn't really help. The httpclient is already configured to use a reasonable timeout (3s).

I'm fairly certain this is because Google is closing the kept-alive connection in the background; but I'm hesitant to add Connection: Close (or otherwise close the connection on our end) unless that's the only way to solve this.

@dom96
Copy link
Contributor

dom96 commented Dec 5, 2018

huh, it's SIGPIPE'ing on a close. Interesting, it appears to be OpenSSL's fault: https://github.com/nim-lang/Nim/blob/v0.18.0/lib/pure/net.nim#L866.

Anyway, please follow the advice here: https://stackoverflow.com/a/108192/492186. As a workaround you can ignore all SIGPIPEs in your program, as a proper fix I guess we need to set some OpenSSL flag or something.

@niv
Copy link
Contributor Author

niv commented Dec 11, 2018

Heya dom96, I tried that, like so:

from posix import signal, SIG_PIPE, SIG_IGN
signal(SIG_PIPE, SIG_IGN)

It seemed to have worked for a good while, but today it fell over with this (one exception per line):

Broken pipe Additional info: IO error has occurred in the BIO layer; will retry
not isClosed(socket) Cannot `send` on a closed socket; will retry
Broken pipe Additional info: IO error has occurred in the BIO layer; not retrying anymore

All future requests then also failed on the assert not isClosed and I had to bump the process to get functionality back.

@dom96
Copy link
Contributor

dom96 commented Dec 11, 2018 via email

@niv
Copy link
Contributor Author

niv commented Dec 11, 2018

Ah, sorry. Don't have those anymore.

I've also added workaround code now that force-closes the httpclient after reach request chain, so that the production process doesn't hang itself.

@endragor
Copy link
Contributor

This is a double-close issue that I'm also seeing in production. It's not an OpenSSL issue, as the error sometimes is not (pkey.ident == 0) Descriptor 26 is not registered in the selector! and sometimes it's an OpenSSL crash, caused by double-free, like in the OP's case. Ignoring SIGPIPE or other such signals is not an option as the app will continue working in a corrupted state, which you do not want - better to crash early and restart.

On Socket/AsyncSocket level there is no checking for whether the socket was already closed, and it seems that under some condition httpclient.request may close the socket after a failure occurred without setting connected field to false. By glancing at the code I'm not yet sure where exactly this happens.

The code on which the error occurs looks like this:

var httpClient = newAsyncHttpClient(userAgent = ctx.userAgent)
# headers omitted
let respFuture = httpClient.request(ctx.endpoint, httpMethod = HttpPOST, body = reqBody)
yield respFuture
if respFuture.failed:
  httpClient.close() # Double close here!
  raise respFuture.readError()

A simple fix would be to add connected field to Socket/AsyncSocket objects to avoid any double-close issues. But the httpclient issue should also be fixed.

@endragor
Copy link
Contributor

endragor commented Mar 18, 2019

So apparently the scenario is this:

  1. Server disconnects, leaving the client's underlying OS socket in disconnected state.
  2. AsyncHttpClient calls socket.close(), where SslShutdown errors out (due to the disconnected socket) and so close() raises. This leaves AsyncHttpClient in connected state, since it doesn't handle exceptions from socket.close().
  3. close() is called again and SslShutdown/SSLFree is called for the second time on invalid pointer, corrupting the memory.

So either socket.close() should be made to never raise or else any code that calls it must handle the exceptions. If it raises, it should clean up the state as much as possible (e.g. set sslHandle to nil after freeing it and not use it if it's already nil).

SSL_shutdown can be made to always succeed by calling SSL_CTX_set_quiet_shutdown, according to the SSL_shutdown documentation.

@endragor
Copy link
Contributor

Another important line from SSL_shutdown() doc:

Note that SSL_shutdown() must not be called if a previous fatal error has occurred on a connection i.e. if SSL_get_error() has returned SSL_ERROR_SYSCALL or SSL_ERROR_SSL.

So we should also track whether a fatal error occurred on the connection and not call SSL_shutdown in such case. If the above condition is violated, SSL_shutdown may raise even if SSL_CTX_set_quiet_shutdown() was called.

@Araq Araq added the Severe label Mar 19, 2019
@dom96
Copy link
Contributor

dom96 commented Mar 20, 2019

close() is called again and SslShutdown/SSLFree

This implies that the original exception raised by SslShutdown is getting caught somewhere, if so where?

@endragor
Copy link
Contributor

It results in a failed future returned from request(). Regardless, AsyncHttpClient.close() is expected to be always safe to call, and here it gets in a state where calling it results in a crash or corrupted memory.

@jorisbontje
Copy link

@dom96 Here is a stack trace for this issue with nim 1.0.4 and SIG_PIPE ignored (per niv's comments):

Got exception ref 0x7f0485be2098 --> [errorCode = 32,
parent = nil,
name = 0x5597fa08ce9a"OSError",
msg = 0x7f0485be12d8"Broken pipe\10"
"Additional info: \"IO error has occurred in the BIO layer\"",
trace = 0x7f04841c6048@[[procname = 0x5597fa0956f5"eos_watcher_rpc",
line = 239,
filename = 0x5597fa095328"/root/eos-watcher/eos_watcher_rpc.nim"], [procname = 0x5597fa095673"main",
line = 206,
filename = 0x5597fa095328"/root/eos-watcher/eos_watcher_rpc.nim"], [procname = 0x5597fa095468"get_block",
line = 174,
filename = 0x5597fa095328"/root/eos-watcher/eos_watcher_rpc.nim"], [procname = 0x5597fa09157b"request",
line = 932,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09157b"request",
line = 907,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09154a"requestAux",
line = 892,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09150a"parseResponse",
line = 770,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa091500"parseBody",
line = 678,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa09140e"recvFull",
line = 608,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa0911a7"close",
line = 563,
filename = 0x5597fa091110"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim"], [procname = 0x5597fa08f3bb"close",
line = 900,
filename = 0x5597fa08f118"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim"], [procname = 0x5597fa08f2d9"socketError",
line = 735,
filename = 0x5597fa08f118"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim"], [procname = 0x5597fa08cea2"raiseOSError",
line = 94,
filename = 0x5597fa08cd98"/root/.choosenim/toolchains/nim-1.0.4/lib/pure/includes/oserr.nim"]],
raiseId = 0,
up = ref 0x7f0485be2098]
 with message Broken pipe
Additional info: "IO error has occurred in the BIO layer"
Sleeping for 60 seconds...
/root/eos-watcher/eos_watcher_rpc.nim(239) eos_watcher_rpc
/root/eos-watcher/eos_watcher_rpc.nim(206) main
/root/eos-watcher/eos_watcher_rpc.nim(174) get_block
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim(932) request
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim(907) request
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/httpclient.nim(886) requestAux
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim(1349) send
/root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim(1334) send
/root/.choosenim/toolchains/nim-1.0.4/lib/system/assertions.nim(27) failedAssertImpl
/root/.choosenim/toolchains/nim-1.0.4/lib/system/assertions.nim(20) raiseAssert
/root/.choosenim/toolchains/nim-1.0.4/lib/system/fatal.nim(39) sysFatal
Error: unhandled exception: /root/.choosenim/toolchains/nim-1.0.4/lib/pure/net.nim(1334, 9) `not socket.isClosed` Cannot `send` on a closed socket [AssertionError]
Error: execution of an external program failed: '/root/eos-watcher/eos_watcher_rpc '

@zedeus
Copy link
Contributor

zedeus commented Apr 7, 2020

This (closing httpclient) is causing my web service with thousands of users to occasionally crash, sometimes more often than others. I tried @endragor's suggestions but it didn't seem to help. Error messages like this:

*** Error in `./nitter': free(): invalid pointer: 0x0000558d1c889a00 ***
*** Error in `./nitter': double free or corruption (!prev): 0x000055b56f28ef20 ***
INFO Jester is making jokes at http://0.0.0.0:5000
Traceback (most recent call last)
/home/zed/src/dev/nim/nitter/src/nitter.nim(33) nitter
/home/zed/src/dev/nim/jester/jester.nim(508) serve
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(1879) runForever
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(1574) poll
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(1348) runOnce
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncdispatch.nim(210) processPendingCallbacks
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/asyncmacro.nim(34) fetchHtmlNimAsyncContinue
/home/zed/src/dev/nim/nitter/src/api/utils.nim(34) fetchHtmlIter
/home/zed/.choosenim/toolchains/nim-#devel/lib/pure/httpclient.nim(580) close
/home/zed/.choosentm/toolchains/nim-#devel/lib/pure/asyncnet.nim(697) close
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

@dom96
Copy link
Contributor

dom96 commented Apr 16, 2020

I'd give it another try after my openssl changes (#13919 ), this seems to be somewhat related so it's possible it was fixed.

@endragor
Copy link
Contributor

That PR doesn't seem to affect this bug. The bug is not about OpenSSL, it's about Nim code on higher level - namely:

  1. AsyncHttpClient
  2. Usage of SSL_shutdown - it may error out in situations which are expected for the code that uses it, so it should be handled.

See my old comments here - I pinpointed where the problems are.

@zedeus
Copy link
Contributor

zedeus commented Apr 18, 2020

Another error message:error:140E0197:SSL routines:SSL_shutdown:shutdown while in init

@tomaso909
Copy link

I have a Django / Gunicorn / Nginx server and decided to write some speed critical parts in Nim. I created the .so files with Nimpy and after testing them thoroughly they seem to work well. However, when I use them as part of the server loop in production, I often get a SIGPIPE: Pipe closed and the server dies (ignoring any try - catch).

I was looking for information about this problem which I attributed to Python at first and ended up here. So I removed all the Nim code from the server and now it runs without issues. Therefore I guess this is the same issue you are mentioning in this thread but in this case the server is not written in Nim. In fact I just have a couple of Nim files with one (very simple) function each.

Please let me know if you want to see some specific logs / stack traces to address this issue.

@Araq
Copy link
Member

Araq commented May 7, 2020

Please let me know if you want to see some specific logs / stack traces to address this issue.

Oh, yes, please!

@tomaso909
Copy link

I've simplified the stack to test it in an easier way. I've removed nginx and gunicorn and run the Python app directly in Django's development server. Also I've removed all calls to nim generated .so modules from the Python code with the exception of a single function. With this arrangement the server keeps crashing. An it does so very early, either on page load or after interacting with the site once or twice.

To get some stacktraces, the first thing I tried is to wrap the Python function that calls the nim .so in try-catches with no luck. Server execution stops with a "SIGPIPE: Pipe closed" message and I am not able to catch and/or print any error. This is an example of the Python development server output:

System check identified no issues (0 silenced).
May 09, 2020 - 15:34:14
Django version 1.10.8, using settings 'D3Tools.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
[09/May/2020 15:34:16] "GET /accounts/logout/ HTTP/1.1" 200 4699
[09/May/2020 15:34:17] "GET /static/accounts/js/main.js HTTP/1.1" 304 0
[09/May/2020 15:34:17] "GET /static/accounts/css/main.css HTTP/1.1" 304 0
[09/May/2020 15:34:17] "GET /static/accounts/water/css/style.css HTTP/1.1" 304 0
[09/May/2020 15:34:17] "GET /static/accounts/images/d3_logo_horizontal_white.png HTTP/1.1" 200 4964
[Server|login]: User truiz login successfull.
[09/May/2020 15:34:25] "POST /accounts/login/ HTTP/1.1" 302 0
[09/May/2020 15:34:25] "GET /accounts/ HTTP/1.1" 200 4960
SIGPIPE: Pipe closed.

Then I tried to compile the nim code with the stackTrace flag:

nim c --threads:on --app:lib --profiler:on --stackTrace:on --out:transformations.so transformations

In these case I get the following output from the server before crashing (instead of the SIGPIPE: Pipe closed):

./GUI/scripts/run_local.sh: line 11: 23939 Segmentation fault (core dumped) python manage.py runserver

Where run_local.sh is just a bash script that compiles webpack and starts the Django development server with "python manage.py runserver". So again, not much information.

If you can tell me which specific logs / stack traces you'd like to see and how to get them I'm happy to provide them.

@ghost
Copy link

ghost commented May 9, 2020

@tomaso909 trying to minimize your Nim code to be able to share it would really help a lot, or also trying to replicate it without Django (maybe by using the .so file from another Nim program)

@tomaso909
Copy link

The nim code is very small already. Here I attach the transformations.nim file. Just by calling the vectorEulerRotation function from Python I already get the SIG_PIPE errors. I've tested the function in Python without using Django and it works perfectly.

import nimpy
import math


type
    Vector3 = array[3, float]

proc vectorEulerRotation(vector: Vector3, yawPitchRoll: Vector3, reverse: bool): Vector3 {.exportpy.} =

    let yaw: float = degToRad(yawPitchRoll[0])
    let pitch: float = degToRad(yawPitchRoll[1])
    let roll: float = degToRad(yawPitchRoll[2])
    
    let cosYaw: float = cos(yaw)
    let sinYaw: float = sin(yaw)
    let cosPitch: float = cos(pitch)
    let sinPitch: float = sin(pitch)
    let cosRoll: float = cos(roll)
    let sinRoll: float = sin(roll)
    
    let matrix_00: float = cosYaw*cosPitch
    let matrix_01: float = cosYaw*sinPitch*sinRoll-sinYaw*cosRoll
    let matrix_02: float = cosYaw*sinPitch*cosRoll+sinYaw*sinRoll
    let matrix_10: float = sinYaw*cosPitch
    let matrix_11: float = sinYaw*sinPitch*sinRoll+cosYaw*cosRoll
    let matrix_12: float = sinYaw*sinPitch*cosRoll-cosYaw*sinRoll
    let matrix_20: float = -sinPitch
    let matrix_21: float = cosPitch*sinRoll
    let matrix_22: float = cosPitch*cosRoll
    
    if reverse:
        let rotated: Vector3 = [
            matrix_00*vector[0]+matrix_10*vector[1]+matrix_20*vector[2],
            matrix_01*vector[0]+matrix_11*vector[1]+matrix_21*vector[2],
            matrix_02*vector[0]+matrix_12*vector[1]+matrix_22*vector[2]
        ]
        return rotated
    else:
        let rotated: Vector3 = [
            matrix_00*vector[0]+matrix_01*vector[1]+matrix_02*vector[2],
            matrix_10*vector[0]+matrix_11*vector[1]+matrix_12*vector[2],
            matrix_20*vector[0]+matrix_21*vector[1]+matrix_22*vector[2]
        ]
        return rotated

This function simply transforms a vector. It does not do anything network related.

@ghost
Copy link

ghost commented May 9, 2020

@tomaso909 I can replicate it with a simple Django app, will try to understand why that happens, thanks for the info :)

@ghost
Copy link

ghost commented May 9, 2020

@tomaso909 I think it might be not related to Nim at all - maybe it's the Django's fault really? I can reproduce it when using wrk to stress-test Django and then exit wrk with Ctrl+C, maybe it's Django who's sending those to the script or something?

@tomaso909
Copy link

tomaso909 commented May 10, 2020

@Yardanico I see your point. But if I do not import the Nim function in Django the server works without problems. So there is an interaction there that causes the error.

I posted the problem here because it looks the same you have in pure Nim servers and I was not able to find information for Django on this exact error.

@tomaso909
Copy link

@Yardanico @Araq If my case is interesting to you I can prepare a small reproducible example with a very minimal Django server. On the other hand, if you think this is more of a Django issue (or Nimpy perhaps) I can ask elsewhere.
Let me know. I'm loving Nim and I'm happy to help.

@endragor
Copy link
Contributor

This is not a Django issue. The cause of this bug is described in #9867 (comment)

With a bit effort you can simulate it by raising error on a call to socket.close(). https://github.com/nim-lang/Nim/blob/devel/lib/pure/asyncnet.nim#L721 - replace this line with raising the exception unconditionally and then call socket.close() twice. That is the essence of the problem. The second call results in corrupted memory, because SslShutdown and SSLFree are called second time on the pointers that are already freed. There is socket.closed = true, but it is not getting checked.
The actual user code may have this problem even without calling close() twice - when you use AsyncHttpClient, it closes the socket internally, and you also need to close() the AsyncHttpClient itself. So even if you have single close() in your code, and the code itself is correct, there is a corruption. And anyway, worst case should be you get an exception when you misuse an API, not a corruption.

The AsyncSocket and SSL APIs in Nim are in general very fragile and easy to misuse, creating leaks and other bugs. For example, you may call wrapSocket() on a socket that was already wrapped and previous pointers will be leaked. Thorough review of the API is necessary, with exceptional cases being documented and handled where necessary. At the moment I wouldn't suggest using these APIs in production applications - they are ridden with bugs even if you use them correctly, and you may not get a timely error when you misuse them.

@dom96
Copy link
Contributor

dom96 commented May 21, 2020

The AsyncSocket and SSL APIs in Nim are in general very fragile and easy to misuse, creating leaks and other bugs.

This IMO qualifies this as a showstopper. I'll let Araq change it if he disagress but otherwise we could use some support to fix it.

@Araq Araq added this to the 1.4.0 milestone Jun 1, 2020
@D-Nice
Copy link
Contributor

D-Nice commented Jun 26, 2020

I am using ssl + asynchttpclient and getting a segfault from a long-running application, last one was after 2 days.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f10582e61bd in ssl3_read_internal () from /lib64/libssl.so.1.1                                                                                                                                                                       
(gdb) bt                                                                                                                                                                                                         
#0  0x00007f10582e61bd in ssl3_read_internal () from /lib64/libssl.so.1.1                                                                                                                                        
#1  0x00007f10582f133e in ssl_read_internal () from /lib64/libssl.so.1.1
#2  0x00007f10582f1447 in SSL_read () from /lib64/libssl.so.1.1                                                                                                                                                  
#3  0x00000000004278af in ?? ()                                                                                                                                                                                  
#4  0x0000000000427b63 in ?? ()                                                                                                                                                                                  
#5  0x0000000000427f08 in ?? ()                                                                                                                                                                                  
#6  0x0000000000428057 in ?? ()                                                                                                                                                                                  
#7  0x000000000042811a in ?? ()                                                                                                                                                                                  
#8  0x000000000042847c in ?? ()
#9  0x000000000042fd19 in ?? ()                                                                                                                                                                                  
#10 0x000000000042fde1 in ?? ()                                                                                                                                                                                  
#11 0x000000000043013b in ?? ()                                                                                                                                                                                  
#12 0x0000000000431bfd in ?? ()                                                                                                                                                                                  
#13 0x0000000000431f88 in ?? ()                                                                                                                                                                                  
#14 0x00000000004210d9 in ?? ()                                                                                                                                                                                  
#15 0x000000000042168e in ?? ()                                                                                                                                                                                  
#16 0x0000000000458b6a in ?? ()                                                                                                                                                                                  
#17 0x00007f1058e9fe9c in __run_exit_handlers () from /lib64/libc.so.6                                                                                                                                           
#18 0x00007f1058e9ffd0 in exit () from /lib64/libc.so.6                                                                                                                                                          
#19 0x0000000000406f9a in ?? ()                                                                                                                                                                                  
#20 <signal handler called>                                                                                                                                                                                      
#21 0x00007f10582e61bd in ssl3_read_internal () from /lib64/libssl.so.1.1                                                                                                                                        
#22 0x00007f10582f133e in ssl_read_internal () from /lib64/libssl.so.1.1                                                                                                                                         
#23 0x00007f10582f1447 in SSL_read () from /lib64/libssl.so.1.1

Fair to assume that is a symptom of the underlying issue here?

There's some mentions of possible work on this, I used 1.3.5 devel (don't have the git hash, but can re-compile with another release and debugging/stacktracing on).

@alaviss
Copy link
Collaborator

alaviss commented Jun 26, 2020

Please recompile with stack trace on. My guess would be that SSL_shutdown() was called on a closed socket, which will generates SIGPIPE if it attempts to negotiate a shutdown with the client and the socket is closed (which will be true most of the time). I can do a fix for this case.

@alaviss
Copy link
Collaborator

alaviss commented Jun 26, 2020

Also I'd recommend that everyone running their Nim program as a SSL server should install SIG_IGN as the SIGPIPE handler (actually we might want to do this in the stdlib). SIGPIPE is a pretty old signal and is largely obsoleted by EPIPE that's returned from any failed send()/write() attempt on a closed socket/pipe (we don't get to handle this one because SIGPIPE got in the way).

@D-Nice
Copy link
Contributor

D-Nice commented Jun 26, 2020

Please recompile with stack trace on. My guess would be that SSL_shutdown() was called on a closed socket, which will generates SIGPIPE if it attempts to negotiate a shutdown with the client and the socket is closed (which will be true most of the time). I can do a fix for this case.

Will do

alaviss added a commit to alaviss/Nim that referenced this issue Jun 26, 2020
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors
coming from a close() after a bad event (ie. in defer blocks).

Ref nim-lang#9867
alaviss added a commit to alaviss/Nim that referenced this issue Jun 26, 2020
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors
coming from a close() after a bad event (ie. in defer blocks).

Ref nim-lang#9867
yglukhov added a commit to yglukhov/async_http_request that referenced this issue Jul 21, 2020
yglukhov added a commit to yglukhov/async_http_request that referenced this issue Jul 21, 2020
alaviss added a commit to alaviss/Nim that referenced this issue Jul 25, 2020
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors
coming from a close() after a bad event (ie. the other end of the pipe
is closed before shutdown can be negotiated).

Ref nim-lang#9867
alaviss added a commit to alaviss/Nim that referenced this issue Jul 25, 2020
Per TLS standard and SSL_shutdown(3ssl). This should prevent errors
coming from a close() after a bad event (ie. the other end of the pipe
is closed before shutdown can be negotiated).

Ref nim-lang#9867
Araq pushed a commit that referenced this issue Jul 29, 2020
…15066)

* asyncnet, net: don't attempt SSL_shutdown if a fatal error occurred

Per TLS standard and SSL_shutdown(3ssl). This should prevent errors
coming from a close() after a bad event (ie. the other end of the pipe
is closed before shutdown can be negotiated).

Ref #9867

* tssl: try sending until an error occur

* tssl: cleanup

* tssl: actually run the test

I forgot to make the test run :P

* tssl: run the test on ARC, maybe then it'll be happy

* tssl: turns off ARC, switch tlsEmulation on for freebsd

* tssl: document why tlsEmulation is employed

* net: move SafeDisconn handling logic to socketError
@D-Nice
Copy link
Contributor

D-Nice commented Aug 1, 2020

Please recompile with stack trace on. My guess would be that SSL_shutdown() was called on a closed socket, which will generates SIGPIPE if it attempts to negotiate a shutdown with the client and the socket is closed (which will be true most of the time). I can do a fix for this case.

Will do

don't mean to necro, but the reason I haven't posted anything is, since that message, is because I had pulled latest devel, and recompiled with stack tracing on etc... and it hasn't crashed since. I will probably recompile with the currently latest devel, and notify back if it runs into issues going forward.

@alaviss
Copy link
Collaborator

alaviss commented Aug 1, 2020

I will probably recompile with the currently latest devel, and notify back if it runs into issues going forward.

Please do :) And thanks for letting me know that previous patches does manage to prevent crashes in certain configurations. With the latest fix, at least for the kind of errors in the first post, is fully resolved. However only in real-world testing would we know if this is the case 100% of the time.

@D-Nice
Copy link
Contributor

D-Nice commented Aug 4, 2020

I will probably recompile with the currently latest devel, and notify back if it runs into issues going forward.

Please do :) And thanks for letting me know that previous patches does manage to prevent crashes in certain configurations. With the latest fix, at least for the kind of errors in the first post, is fully resolved. However only in real-world testing would we know if this is the case 100% of the time.

Seems to be segfaulting on something else now, after using latest nightly compiler

Aug 04 15:26:36 *hostname* app[356277]: DBG 2020-08-04 15:26:36-04:00 Blocking main thread for future resolution tid=356277                                                                                 
Aug 04 15:26:44 *hostname* app[356277]: Traceback (most recent call last)
Aug 04 15:26:44 *hostname* app[356277]: my_app.nim my_app
Aug 04 15:26:44 *hostname* app[356277]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 15:26:44 *hostname* app[356277]: cligen.nim               dispatchapp
Aug 04 15:26:44 *hostname* app[356277]: my_app.nim app
Aug 04 15:26:44 *hostname* app[356277]: log_output.nim           initLogRecord
Aug 04 15:26:44 *hostname* app[356277]: terminal.nim             ansiForegroundColorCode
Aug 04 15:26:44 *hostname* app[356277]: terminal.nim             ansiStyleCode
Aug 04 15:26:44 *hostname* app[356277]: gc.nim                   newObjNoInit
Aug 04 15:26:44 *hostname* app[356277]: alloc.nim                rawAlloc
Aug 04 15:26:44 *hostname* app[356277]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Aug 04 15:26:51 *hostname* systemd[1]: app.service: Main process exited, code=dumped, status=11/SEGV

gdb coredump backtrace doesn't yield anything useful, I may try another nightly, as this appears a seperate problem, but still adding here for posterity and in case I'm mistaken.

@ghost
Copy link

ghost commented Aug 4, 2020

@D-Nice can you try with --gc:arc ?

@D-Nice
Copy link
Contributor

D-Nice commented Aug 4, 2020

@Yardanico the project imports chronicles package, which has a dependency on chronos, which apparently won't compile with arc/orc due to some unsupported cast from chronos.

EDIT: Forgot to mention, it just crashed again after 3ish hours.

@D-Nice
Copy link
Contributor

D-Nice commented Aug 5, 2020

I am attempting other GCs, and have previously run this app with boehm, without issues, but on latest devel, boehm now crashes very shortly after initialization.

First crash with boehm, compressed bin with upx

Aug 04 19:52:49 *hostname* app[371851]: ERR 2020-08-04 19:52:49-04:00 Main event:                                tid=371851                                                                                    
Aug 04 19:52:49 *hostname* app[371851]: error:140E0197:SSL routines:SSL_shutdown:shutdown while in init                                                                                                        
Aug 04 19:52:49 *hostname* app[371851]: Traceback (most recent call last)
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim my_app
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               dispatchapp
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim app
Aug 04 19:52:49 *hostname* app[371851]: DBG 2020-08-04 19:52:49-04:00 Making Request                             tid=371851
Aug 04 19:52:49 *hostname* app[371851]: Traceback (most recent call last)
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim my_app
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 19:52:49 *hostname* app[371851]: cligen.nim               dispatchapp
Aug 04 19:52:49 *hostname* app[371851]: my_app.nim app
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        waitFor
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        poll
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        runOnce
Aug 04 19:52:49 *hostname* app[371851]: asyncdispatch.nim        processPendingCallbacks
Aug 04 19:52:49 *hostname* app[371851]: asyncmacro.nim           sendNimAsyncContinue
Aug 04 19:52:49 *hostname* app[371851]: asyncmacro.nim           sendIter
Aug 04 19:52:49 *hostname* app[371851]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Aug 04 19:52:50 *hostname* systemd[1]: app.service: Main process exited, code=dumped, status=11/SEGV

Second crash with boehm, uncompressed bin

Aug 04 19:58:59 *hostname* app[371964]: DBG 2020-08-04 19:58:59-04:00 Blocking main thread for future resolution tid=371964                                                                                    
Aug 04 19:59:11 *hostname* app[371964]: Traceback (most recent call last)
Aug 04 19:59:11 *hostname* app[371964]: my_app.nim my_app
Aug 04 19:59:11 *hostname* app[371964]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 04 19:59:11 *hostname* app[371964]: cligen.nim               dispatchapp
Aug 04 19:59:11 *hostname* app[371964]: my_app.nim app
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        waitFor
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        poll
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        runOnce
Aug 04 19:59:11 *hostname* app[371964]: asyncdispatch.nim        processPendingCallbacks
Aug 04 19:59:11 *hostname* app[371964]: asyncmacro.nim           sendNimAsyncContinue
Aug 04 19:59:11 *hostname* app[371964]: asyncmacro.nim           sendIter
Aug 04 19:59:11 *hostname* app[371964]: openssl.nim              deallocWrapper
Aug 04 19:59:11 *hostname* app[371964]: boehm.nim                deallocSharedImpl
Aug 04 19:59:11 *hostname* app[371964]: boehm.nim                deallocImpl
Aug 04 19:59:11 *hostname* app[371964]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Aug 04 19:59:13 *hostname* systemd[1]: app.service: Main process exited, code=dumped, status=11/SEGV

gdb backtrace of second crash

(gdb) bt
#0  0x00007f589449102a in GC_free () from /lib64/libgc.so.1
#1  0x00000000004060a6 in ?? ()
#2  0x000000000040613c in ?? ()
#3  0x000000000041756f in ?? ()
#4  0x00007f5892cdca74 in ERR_clear_error () from /lib64/libcrypto.so.1.1
#5  0x00007f58927258e0 in state_machine.part () from /lib64/libssl.so.1.1
#6  0x00007f58926fc5d0 in ssl3_write_bytes () from /lib64/libssl.so.1.1
#7  0x00007f589270f5d6 in ssl_write_internal () from /lib64/libssl.so.1.1
#8  0x00007f589270f707 in SSL_write () from /lib64/libssl.so.1.1
#9  0x00000000004353ff in ?? ()
#10 0x000000000043560a in ?? ()
#11 0x000000000042f319 in ?? ()
#12 0x000000000042fbf4 in ?? ()
#13 0x00000000004300b7 in ?? ()
#14 0x000000000047f4c3 in ?? ()
#15 0x00007f5894a37e9c in __run_exit_handlers () from /lib64/libc.so.6
#16 0x00007f5894a37fd0 in exit () from /lib64/libc.so.6
#17 0x0000000000405ea6 in ?? ()
#18 <signal handler called>
#19 0x00007f589449102a in GC_free () from /lib64/libgc.so.1
#20 0x00000000004060a6 in ?? ()
#21 0x000000000040613c in ?? ()
#22 0x000000000041756f in ?? ()
#23 0x00007f5892cdca74 in ERR_clear_error () from /lib64/libcrypto.so.1.1
#24 0x00007f589272a22d in tls_process_server_certificate () from /lib64/libssl.so.1.1
#25 0x00007f589272c965 in ossl_statem_client_process_message () from /lib64/libssl.so.1.1
#26 0x00007f5892726043 in state_machine.part () from /lib64/libssl.so.1.1
#27 0x00007f58926fc5d0 in ssl3_write_bytes () from /lib64/libssl.so.1.1
#28 0x00007f589270f5d6 in ssl_write_internal () from /lib64/libssl.so.1.1
#29 0x00007f589270f707 in SSL_write () from /lib64/libssl.so.1.1
#30 0x00000000004353ff in ?? ()
#31 0x000000000043560a in ?? ()
#32 0x000000000042f319 in ?? ()
#33 0x000000000042fbf4 in ?? ()
#34 0x00000000004300b7 in ?? ()
#35 0x0000000000432753 in ?? ()
#36 0x0000000000489be2 in ?? ()
#37 0x000000000048d75e in ?? ()
#38 0x000000000048da7b in ?? ()
#39 0x00000000004905bc in ?? ()
#40 0x0000000000490631 in ?? ()
#41 0x00007f5894a216a3 in __libc_start_main () from /lib64/libc.so.6
#42 0x000000000040248e in ?? ()

The latest nightly I'm referring to is:

Nim Compiler Version 1.3.5 [Linux: amd64]
Compiled at 2020-08-03
Copyright (c) 2006-2020 by Andreas Rumpf

git hash: d6161e7e9ea1f40ad734c6e811e8169fe24b5e9f
active boot switches: -d:release

@D-Nice
Copy link
Contributor

D-Nice commented Aug 5, 2020

As a note, similar crash still in boehm, with 1.2.6. I could swear I ran it fine with some earlier nim 1.2.x, I'll investigate more there.

markAndSweep won't compile at all with nightly for my app, getting
Error: system module needs: unsureAsgnRef
compiles fine with stable 1.2.4 & 1.2.6

Running it for now with 1.2.6 and M&S gc to see how it fares over time.

@alaviss
Copy link
Collaborator

alaviss commented Aug 7, 2020

This is interesting. Can you enable debug information (with --debugger:native) and use the default GC? It should give out coredumps with usable stacktrace then.

Also I'm seeing

error:140E0197:SSL routines:SSL_shutdown:shutdown while in init

in one of your logs. This is really odd as this bug should have been completely fixed in devel. Can you confirm your version with nim -v?

@D-Nice
Copy link
Contributor

D-Nice commented Aug 8, 2020

@alaviss

For that SSL error you've mentioned, I can confirm it was d6161e7e9ea1f40ad734c6e811e8169fe24b5e9f for the git hash of compiler used to compile that variant, albeit with boehm gc.

I'm looking to create a minimal reproducible example, and see if I can get same issues going there.

@D-Nice
Copy link
Contributor

D-Nice commented Aug 9, 2020

Compiled with recent devel: 2a8bf470af995f4e7d5bb8a0f49bb2226f95c311

Compile flags for the code: --gc:refc --stackTrace:on --debugger:native --skipProjCfg:on -d:chronicles_log_level:TRACE -d:danger --opt:size -d:ssl

backtrace with native debugger flag:

#0  0x00007f9e0cd1a3a3 in OPENSSL_sk_pop_free () from /lib64/libcrypto.so.1.1
#1  0x00007f9e0cd34e9d in X509_VERIFY_PARAM_free () from /lib64/libcrypto.so.1.1
#2  0x00007f9e0c49ed15 in SSL_free () from /lib64/libssl.so.1.1
#3  0x000055b9da4f1662 in close__g8nGs1aFQjSyKaCvnSGH1w (socket=0x7f9e0b9ee030) at /nim/lib/pure/asyncnet.nim:699                                                                                                
#4  0x000055b9da4fb8d3 in close__h9bWTiUUZLlQgLXHDtjyONA (client=0x7f9e0b819f28) at /nim/lib/pure/httpclient.nim:619                                                                                             
#5  0x000055b9da52feb7 in reqBtcAddressDetailsIter__ioevzQT4OCMuTzRWSwCM6g_2 (ClE_0=0x7f9e0bbb2518) at /tmp/project/src/crypto_address_watcher.nim:22                                                            
#6  0x000055b9da5304ac in reqBtcAddressDetailsNimAsyncContinue__Z9cQmJ9aPol6ErCmKfu4D3gw_2 (ClE_0=0x7f9e0b9d07c8) at /nim/lib/pure/asyncmacro.nim:34                                                             
#7  0x000055b9da4ec915 in processPendingCallbacks__xdICRvJ1ns9byYHPjM07uJQ (p=p@entry=0x7f9e0d7b1ae8, didSomeWork=didSomeWork@entry=0x7ffc0706f71f "\001\b\374\006\a\374\177")                                   
    at /nim/lib/pure/asyncdispatch.nim:210
#8  0x000055b9da4ecb8c in runOnce__nEnn7ilZ6aSkcXTtLEKnJA (timeout=timeout@entry=1000) at /nim/lib/pure/asyncdispatch.nim:1340                                                                                   
#9  0x000055b9da4ecf99 in poll__wcvh7vf5dGU1VYto9aiAnpA (timeout=1000) at /nim/lib/pure/asyncdispatch.nim:1576                                                                                                   
#10 0x000055b9da52ee6f in quitPoll__R2CWtKf4BdVZhGAdjU9b3cg_2 () at /tmp/project/src/crypto_address_watcher.nim:242                                                                                              
#11 0x00007f9e0d049e9c in __run_exit_handlers () from /lib64/libc.so.6
#12 0x00007f9e0d049fd0 in exit () from /lib64/libc.so.6
#13 0x000055b9da4c482b in signalHandler (sign=<optimized out>) at /nim/lib/system/excpt.nim:624
#14 <signal handler called>
#15 0x00007f9e0cd4d479 in OPENSSL_cleanse () from /lib64/libcrypto.so.1.1
#16 0x00007f9e0ccb9195 in CRYPTO_clear_free () from /lib64/libcrypto.so.1.1
#17 0x00007f9e0cbff221 in BUF_MEM_free () from /lib64/libcrypto.so.1.1
#18 0x00007f9e0cbe336a in mem_free () from /lib64/libcrypto.so.1.1
#19 0x00007f9e0cbdc49a in BIO_free () from /lib64/libcrypto.so.1.1
#20 0x00007f9e0cbdcfdc in BIO_free_all () from /lib64/libcrypto.so.1.1
#21 0x00007f9e0c49ed43 in SSL_free () from /lib64/libssl.so.1.1
#22 0x000055b9da4f1662 in close__g8nGs1aFQjSyKaCvnSGH1w (socket=0x7f9e0bb3e030) at /nim/lib/pure/asyncnet.nim:699                                                                                                
#23 0x000055b9da4fb8d3 in close__h9bWTiUUZLlQgLXHDtjyONA (client=0x7f9e0bbbc2e8) at /nim/lib/pure/httpclient.nim:619                                                                                             
#24 0x000055b9da52f5d3 in reqEthAddressDetailsIter__ioevzQT4OCMuTzRWSwCM6g (ClE_0=0x7f9e0bbb2258) at /tmp/project/src/crypto_address_watcher.nim:22                                                              
#25 0x000055b9da530262 in reqEthAddressDetailsNimAsyncContinue__Z9cQmJ9aPol6ErCmKfu4D3gw (ClE_0=0x7f9e0ba60f98) at /nim/lib/pure/asyncmacro.nim:34                                                               
#26 0x000055b9da4ec915 in processPendingCallbacks__xdICRvJ1ns9byYHPjM07uJQ (p=p@entry=0x7f9e0d7b1ae8, didSomeWork=didSomeWork@entry=0x7ffc07070f5f "\001H\024\a\a\374\177")                                      
    at /nim/lib/pure/asyncdispatch.nim:210
#27 0x000055b9da4ecb8c in runOnce__nEnn7ilZ6aSkcXTtLEKnJA (timeout=timeout@entry=500) at /nim/lib/pure/asyncdispatch.nim:1340                                                                                    
#28 0x000055b9da4ecf99 in poll__wcvh7vf5dGU1VYto9aiAnpA (timeout=timeout@entry=500) at /nim/lib/pure/asyncdispatch.nim:1576                                                                                      
#29 0x000055b9da4ed03f in waitFor__zz9aVsJN0aTuFUXvzKrFo9cA (fut=0x7f9e0bb01288) at /nim/lib/pure/asyncdispatch.nim:1886                                                                                         
#30 0x000055b9da538758 in caw__mMsoXFdgrPsk9aq0MM3yjQw (timeoutInSeconds=<optimized out>, ethAddressList=0x7f9e0d785d48, afterEthBlock=10622183, btcAddressList=0x7f9e0d785cc8, afterBtcBlock=620000,            
    alertSlack=1 '\001', alertAlexa=1 '\001', alertNotify=0 '\000', healthCheckUrl=0x7f9e0d786108, customCmd=0x0, retry=1 '\001', requestDelay=500) at /tmp/project/src/crypto_address_watcher.nim:189           
#31 0x000055b9da53bd09 in dispatchcaw__CzSmPHCYzo7ii1SVC5sbZQ (cmdline=0x7f9e0d7b7120, usage=<optimized out>, prefix=<optimized out>, parseOnly=<optimized out>, skipHelp=<optimized out>,                       
    noHdr=<optimized out>, ClE_0=0x7f9e0d7b7048) at /root/.nimble/pkgs/cligen-1.1.0/cligen.nim:703
#32 0x000055b9da53be4f in cligenDoNotCollideWithGlobalVar__pv2uo9c25a0IVl6NgMfudlA (cfX60gensym41140218_=0x55b9da596100 <clCfg__POT08z3OfOElyZB6YNbrng>) at /root/.nimble/pkgs/cligen-1.1.0/cligen.nim:741       
#33 0x000055b9da53c04a in NimMainModule () at /tmp/project/src/crypto_address_watcher.nim:251
#34 0x000055b9da53ee34 in NimMain () at /root/.nimble/pkgs/serialization-0.1.0/serialization/errors.nim:96                                                                                                       
#35 0x000055b9da4bc89e in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /root/.nimble/pkgs/serialization-0.1.0/serialization/errors.nim:103                                          

logs

Aug 08 21:54:14 *hostname* caw[423968]: TRC 2020-08-08 21:54:14-04:00 new async http client up                   tid=423968 client="ref 0x7f9e0bbbc208 --> [socket = nil,\nconnected = false,\ncurrentU"    
Aug 08 21:54:14 *hostname* caw[423968]: TRC 2020-08-08 21:54:14-04:00 new async http client up                   tid=423968 client="ref 0x7f9e0b819f28 --> [socket = nil,\nconnected = false,\ncurrentU"    
Aug 08 21:54:15 *hostname* caw[423968]: TRC 2020-08-08 21:54:15-04:00 closing async http client                  tid=423968 client="ref 0x7f9e0bbbc2e8 --> [socket = ref 0x7f9e0bb3e030 --> [fd = 3,\n"     
Aug 08 21:54:15 *hostname* caw[423968]: Traceback (most recent call last)
Aug 08 21:54:15 *hostname* caw[423968]: crypto_address_watcher.nim crypto_address_watcher
Aug 08 21:54:15 *hostname* caw[423968]: cligen.nim               cligenDoNotCollideWithGlobalVar
Aug 08 21:54:15 *hostname* caw[423968]: cligen.nim               dispatchcaw
Aug 08 21:54:15 *hostname* caw[423968]: crypto_address_watcher.nim caw
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        waitFor
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        poll
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        runOnce
Aug 08 21:54:15 *hostname* caw[423968]: asyncdispatch.nim        processPendingCallbacks
Aug 08 21:54:15 *hostname* caw[423968]: asyncmacro.nim           reqEthAddressDetailsNimAsyncContinue                                                                                                       
Aug 08 21:54:15 *hostname* caw[423968]: asyncmacro.nim           reqEthAddressDetailsIter
Aug 08 21:54:15 *hostname* caw[423968]: httpclient.nim           close
Aug 08 21:54:15 *hostname* caw[423968]: asyncnet.nim             close
Aug 08 21:54:15 *hostname* caw[423968]: SIGSEGV: Illegal storage access. (Attempt to read from nil?)                                                                                                        
Aug 08 21:54:18 *hostname* caw[423968]: TRC 2020-08-08 21:54:18-04:00 closing async http client                  tid=423968 client="ref 0x7f9e0bbbc128 --> [socket = ref 0x7f9e0d71d030 --> [fd = 11,"      
Aug 08 21:54:18 *hostname* caw[423968]: TRC 2020-08-08 21:54:18-04:00 closed async http client                   tid=423968 client="ref 0x7f9e0bbbc128 --> [socket = ref 0x7f9e0d71d030 --> [fd = 11,"      
Aug 08 21:54:18 *hostname* caw[423968]: TRC 2020-08-08 21:54:18-04:00 closing async http client                  tid=423968 client="ref 0x7f9e0b819f28 --> [socket = ref 0x7f9e0b9ee030 --> [fd = 13,"      
Aug 08 21:54:19 *hostname* systemd[1]: caw.service: Main process exited, code=dumped, status=11/SEGV                                                                                                        
Aug 08 21:54:19 *hostname* systemd[1]: caw.service: Failed with result 'core-dump'.

@alaviss
Copy link
Collaborator

alaviss commented Aug 11, 2020

This seems to be caused by a double close. I've PR-ed a (hack?) fix in #15174. A proper fix will be done once I got some insight into some weird design within asyncnet.

mildred pushed a commit to mildred/Nim that referenced this issue Jan 11, 2021
…im-lang#15066)

* asyncnet, net: don't attempt SSL_shutdown if a fatal error occurred

Per TLS standard and SSL_shutdown(3ssl). This should prevent errors
coming from a close() after a bad event (ie. the other end of the pipe
is closed before shutdown can be negotiated).

Ref nim-lang#9867

* tssl: try sending until an error occur

* tssl: cleanup

* tssl: actually run the test

I forgot to make the test run :P

* tssl: run the test on ARC, maybe then it'll be happy

* tssl: turns off ARC, switch tlsEmulation on for freebsd

* tssl: document why tlsEmulation is employed

* net: move SafeDisconn handling logic to socketError
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment