-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4.1.2: Assertion failed: Connection reset by peer (zeromq\src\signaler.cpp:298) #1808
Comments
Which socket type are you using? |
There will be several SUB and DEALER sockets in the dying process. |
Can it be something like firewall? Is it a virtual machine or PC? |
It could be a firewall, I don't have access to the failing machine. I believe it is a real machine (cc @jveitchmichaelis for more details). |
It's a PC - local notebook. |
Few questions:
|
Sorry - I meant the Jupyter notebook is running as a local server. It's a desktop PC. The Windows firewall is on. The time varies - I'll try measuring it properly though. It happens both when idle and in the middle of running code (at least so it seems). My desktop is set not to sleep, so I don't think it's that. I'll try:
|
Can you try with firewall of as well? On Mon, Feb 22, 2016 at 4:33 PM, jveitchmichaelis notifications@github.com
|
Yep will do, I'll see if turning on more verbose debugging in Jupyter throws up anything as well. |
Hi,
Reproduction:
We have tested this on three computers, and two of which will have this crash problem. |
So recently someone changed the signaler to use random port, this might Or bottom line, in config.hpp make sure the signaler_port is set to 5905 Also in firewall make sure to allow tcp connection on port 5905 to the Let me know if this help. On Wed, Feb 24, 2016 at 6:16 AM, JimChenTaiwan notifications@github.com
|
OK! Thanks for your suggestion, We will try it. We also found another problem, |
We need to make this configurable, since reusing the same port leads to
|
Hi, |
Did you try same test with firewall disabled? |
Yes, I have run some tests with firewall disabled, but it crashed after 2 hours. We have tested "local_thr.exe" with firewall disabled, and it crashed after 2 hours. (2016/2/26) |
@JimChenTaiwan are you using windows 10 as well? |
Yes, I test it on Windows 10 x64 ver. |
I'm seeing what I think is identical behaviour here as well. Running the following:
After almost exactly two hours the following assert fires:
Similar behaviour occurs if compiled with poll enabled rather than select:
Running the same build of our software on Windows 7 doesn't demonstrate this issue. I'm using inproc connections and running Additionally:
|
@mseagrief does it matter if the socket is idle or not? |
it seems like issue with windows 10, I tried to google it without much success, can you open an issue with Microsoft? |
@somdoron I'll do some tests today with the socket idle and report back. As to reporting to Microsoft, I'm not sure what I'd be reporting. I'm not sufficiently familiar with the ZeroMQ internals to make a useful bug report. That said I'll spend some time in the next few days doing so and see if I can isolate the issue |
ZeroMQ 3.2.5 doesn't appear to exhibit this behaviour. Just running zmq_init and then sleeping for 3 hours doesn't crash. However 5 minutes after starting to use the sockets (using inproc_thr) the assert fires. Running zmq_init and then creating and binding a PULL socket and then sleeping asserts after 2hrs. |
It looks like the code added to address #1608 causes this behaviour to be triggered under Windows 10. Both the poll and select cases appear to work without triggering the assert with the code block in signaler.cpp commented out. As for what the underlying issue is I'm unsure. I tried that code as it was the most significant change I could see in signaler.cpp comparing 3.2.5 against the current git master. |
having the same issue. easy way to reproduce this problem is have a large number of subscribers active and then simulate a congested/broken network using a tool called clumsy (i used ver 0.2). in the tool enable "drop" and "out of order" and increase the percentages. depending on your network load, it should crash eventually. |
#2334 is about connection reset by peer too. Take a look. |
I am seeing a lot of these |
Something is resetting or closing TCP connections on the box. Its not the first time I've heard of this, but it is incredibly toxic behavior. In some circumstances it can be an overzealous admin, or firewall. I'd try to figure out why connections are randomly being spuriously disconnected. That said, I still contend that designs that rely on this TCP loopback for signaling are stupid. With POSIX select and epoll driven loops you might not have a choice (but you usually have pipes or eventfds or socketpair to work around this problem), but on Windows using select() or WSApoll() to drive an event loop is stupid. NNG doesn't do this. I assume that ZeroMQ has a similar design to legacy nanomsg (mostly authored by the same person) and offers no other mechanism. I doubt you can "fix" this other than by changing your design to use separate threads and blocking operations ... that's a big change, not very scalable, but it would completely eliminate the need for those loopback connections. Or you can move away from ZMQ. (Have I mentioned NNG...?? ) I probably won't comment further on this thread, since I'm not affiliated with the ZMQ project in any way. (Peter tried to interest me in the project some months before his death, but I was too busy with nanomsg and really not interested in working on a C++/GPL based project.) |
Agreed - something is killing the connection. But that's not any messaging stacks. It's Windows TCP stack. Or maybe some unknown keep-alive logic running on the loopback type of connection. Unless NanoMsg has built-in logic to re-establish the connection, I don't think it can handle this scenario either - the connection is gone. It's not about select, epoll or iocp. It's a windows thing and we only find it on Windows 10. We have separate data layers (handling the actually inbound & outbound queues). Right now, we are building a "reconnect" logic with ASIO to work around this. |
(I know I said I probably wouldn't comment again... but... I can't resist. Lol.) You're partly right. For TCP connections carrying user data, nanomsg, NNG, and ZMQ all will reconnect automatically. The problem is that ZMQ and legacy nanomsg create a loopback connection using TCP only on Windows in order to provide support for getting a file descriptor that works with WSApoll() or select(). These programs use that connection to make the thread blocked in select or poll wake up (typically writing a byte, and then the library reads that byte after waking up in poll/select, etc.) Under nanomsg, you know you're here if your program uses nn_getsockopt() to get the NN_RCVFD or NN_SNDFD. With ZeroMQ, its the ZMQ_FD. The thing is, on Windows, this is an unnatural act. Some programs do this this because they were designed for UNIX systems, and use poll() at the heart of an event loop. But Windows has much better support for something called IO completion ports, and native asynchronous I/O mechanisms. So programs and runtimes designed for Windows from the beginning have no need for this hackery. Unfortunately, both ZMQ and legacy nanomsg were designed for POSIX first and Windows was mostly an after thought. So they lack any kind of asynchronous support in their APIs, instead settling on nonblocking send and receive operations combined with that notification file descriptor. NNG (the next generation of nanomsg) was designed to be asynchronous from the start, and uses callbacks (along with condition variables and mutex locks for synchronization) instead of poll-based solution. It offers the same descriptor based approach, but whereas in nanomsg and ZMQ that's the only way to do asynchronous or non-blocking I/O, in NNG it's a second class citizen dropped in favor of more scalable, portable, performant, and natural operations. If your program can use threads and synchronization primitives instead of a central event loop, you might not need the ZMQ_FD. That would be more natural on Windows... while threads aren't super scalable on Windows, at least then you could use blocking socket operations and ditch the need for the TCP loopback connection altogether. (Apart from any connection you might be using for your data -- but as I said, ZMQ, like all other similar systems, automatically reconnects those connections.) |
NNG sounds interesting! I will give it a try. Thanks @gdamore ! But, the reconnect logic you mentioned provided by messaging stacks - that part I am not clear about. If the connection is terminated by the peer, it won't be reconnected by ZMQ without application code logic (not from ZMQ). This is what we are discussing here, isn't it? Or you are talking about from TCP protocol point of view? |
I've seen people complain about Windows randomly killing TCP connections just because, in some cases it was a third party network management tool of some sort (nexus-something maybe?), check for that. Besides that, I think Microsoft added af_unix support in Windows 10 - that could be a good alternative to TCP loopback that won't suffer from this problem and should be relatively easy to implement, feel free to send a PR if you like |
I suspect that if there is Af_UNIX support that it won't work with WSApoll
and select. Historically those have been exclusive to winsock. That said
it's worth looking into.
…On Sat, May 26, 2018, 1:10 AM Luca Boccassi ***@***.***> wrote:
I've seen people complain about Windows randomly killing TCP connections
just because, in some cases it was a third party network management tool of
some sort (nexus-something maybe?), check for that.
Besides that, I think Microsoft added af_unix support in Windows 10 - that
could be a good alternative to TCP loopback that won't suffer from this
problem and should be relatively easy to implement, feel free to send a PR
if you like
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1808 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABPDfc47GGEPKlC5gWYC-aWDgZpduBcAks5t2Q4AgaJpZM4Hb7lV>
.
|
So I am pretty sure for zmq (and certainly for nanomsg family stacks) that
even on a peer disconnect the stack will attempt to reconnect. This is in
the dialer side code and is done regardless of transport. This is a big
piece of the self healing strategy these systems provide.
Of course it's up to the client to do this. The server side cannot initiate
a connection.
The thing is that this kind of logic is not provided for the event file
used with these file descriptors. Since both sides of the connection are in
the same process and managed by the library there was never any reason to
think that there would ever be a need. On UNIX systems using pipe() there
is no way for another process or an administrator to interfere here.
Windows is the loser here.
G.
…On Fri, May 25, 2018, 6:07 PM mxcoppell ***@***.***> wrote:
NNG sounds interesting! I will give it a try. Thanks @gdamore
<https://github.com/gdamore> !
But, the reconnect logic you mentioned provided by messaging stacks - that
part I am not clear about. If the connection is terminated by the peer, it
won't be reconnected by ZMQ without application code logic (not from ZMQ).
This is what we are discussing here, isn't it? Or you are talking about
from TCP protocol point of view?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1808 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABPDfbLx8S9OUYFsv3mfsJOPQFaFqDn5ks5t2KrlgaJpZM4Hb7lV>
.
|
When I said "It's a Window's thing", it's not accurate - it's actually something running on Windows 10 (anti-virus, packet filtering, monitoring) is interacting with Windows TCP/IP stack. The "Connection Reset by Peer" in my case only happen on specific computers. Interesting enough, when we do the traffic test on these target machines, the data speed on recv side is much slower than send. When we do the same test on the machines do not have loopback connection drop issue, data speeds of send & recv have much much smaller difference. |
Yes there were similar reports in the past, unfortunately I can't remember the name of the software that was messing stuff up, nor find the issue I'm thinking about |
@bluca , Networx is what you're talking about. It's discussed earlier in this thread. |
The problem is that what's been installed on the client machines - there is no way to predict. Better have a logic to recover in application code. |
jupyter notebook carsh issue solved? |
OS: Windows 10 Same problem on Jupyter Notebook. I have reinstalled anaconda and jupyter, but: Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:201)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
[I 18:33:53.765 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
WARNING:root:kernel 597559d6-5ccb-42f9-b4b0-803047dc72f6 restarted
WARNING:root:kernel 597559d6-5ccb-42f9-b4b0-803047dc72f6 restarted
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Same situation. About every 2 hours .Jupyter notebook will crash. |
After:
it still shows:
|
hi guys, |
I got this problem after I installed 'basemap' package to plot lat longs. Before that my Jupyter notebook was running perfectly fine. Anyone got any fix for this one ? |
Sorry guys.. I found that a bug in my code.. Because a wrong filter weight. |
The implementation is quite simple in this regard: on Windows there's a TCP socket bound on INADDR_LOOPBACK and random port (0). If something on the system messes with that loopback connections, things go awry. |
Hello Garrett, I saw your posts here regarding the error ‘Assertion failed: Connection reset by peer’ occurring in zmq. I was suggested that ‘some application’ would trigger this behavior, like the Firewall or a network monitoring application. Thanks, Michel |
Same problem on Jupyter notebook, |
I recently implemented the |
Environment
Reproduction
...is difficult, but for at least one user, leaving a Jupyter notebook running can die with any of the following asserts:
The mailbox assert is longstanding and still open at #1108. Any ideas on what might be the cause of the connection resets or the size mismatch?
The text was updated successfully, but these errors were encountered: