Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.1.2: Assertion failed: Connection reset by peer (zeromq\src\signaler.cpp:298) #1808

Closed
minrk opened this issue Feb 17, 2016 · 84 comments · Fixed by #3751
Closed

4.1.2: Assertion failed: Connection reset by peer (zeromq\src\signaler.cpp:298) #1808

minrk opened this issue Feb 17, 2016 · 84 comments · Fixed by #3751

Comments

@minrk
Copy link
Member

minrk commented Feb 17, 2016

Environment

  • Windows 10 x64
  • zeromq installed via pyzmq bundle (libzmq-4.1.2)
  • Compiler: MSVC9 (Windows SDK 7.0 / VS2008)

Reproduction

...is difficult, but for at least one user, leaving a Jupyter notebook running can die with any of the following asserts:

Assertion failed: Connection reset by peer (src\signaler.cpp:181)
Assertion failed: Connection reset by peer (src\signaler.cpp:298)
Assertion failed: nbytes == sizeof (dummy) (src\signaler.cpp:303)
Assertion failed: ok (src\mailbox.cpp:94)

The mailbox assert is longstanding and still open at #1108. Any ideas on what might be the cause of the connection resets or the size mismatch?

@somdoron
Copy link
Member

Which socket type are you using?

@minrk
Copy link
Member Author

minrk commented Feb 17, 2016

There will be several SUB and DEALER sockets in the dying process.

@somdoron
Copy link
Member

Can it be something like firewall? Is it a virtual machine or PC?

@minrk
Copy link
Member Author

minrk commented Feb 22, 2016

It could be a firewall, I don't have access to the failing machine. I believe it is a real machine (cc @jveitchmichaelis for more details).

@jveitchmichaelis
Copy link

It's a PC - local notebook.

@somdoron
Copy link
Member

Few questions:

  • How long does it take to this to happen?
  • Is the application idle (didn't receive message for long time, maybe hours?)
  • Is firewall on?
  • Does the notebook come out from sleep?

@jveitchmichaelis
Copy link

Sorry - I meant the Jupyter notebook is running as a local server. It's a desktop PC. The Windows firewall is on.

The time varies - I'll try measuring it properly though. It happens both when idle and in the middle of running code (at least so it seems). My desktop is set not to sleep, so I don't think it's that.

I'll try:

  1. Jupyter with no notebook open
  2. Idle notebook
  3. Notebook in an infinite loop. I don't think the particular piece of code matters, the crashes seem arbitrary.

@somdoron
Copy link
Member

Can you try with firewall of as well?

On Mon, Feb 22, 2016 at 4:33 PM, jveitchmichaelis notifications@github.com
wrote:

Sorry - I meant the Jupyter notebook is running as a local server. It's a
desktop PC. The Windows firewall is on.

The time varies - I'll try measuring it properly though. It happens both
when idle and in the middle of running code (at least so it seems). My
desktop is set not to sleep, so I don't think it's that.

I'll try:

  1. Jupyter with no notebook open
  2. Idle notebook
  3. Notebook in an infinite loop. I don't think the particular piece of
    code matters, the crashes seem arbitrary.


Reply to this email directly or view it on GitHub
#1808 (comment).

@jveitchmichaelis
Copy link

Yep will do, I'll see if turning on more verbose debugging in Jupyter throws up anything as well.

@JimChenTaiwan
Copy link

Hi,
We have the same situation.
Environment:

  • Windows 10 x64
  • both ZMQ 4.0.4 and 4.1.4 x86
  • Compiler: VS2013 x86 build

Reproduction:

  • Always crashed at 2 hours after server & client connected.
  • It happens both when idle (connected) and in the middle of running code.
  • error code:
  •   Assertion failed: Connection reset by peer (......\src\signaler.cpp:298) or
  •   Assertion failed: Connection reset by peer (......\src\signaler.cpp:134) or
  •   Assertion failed: ok (......\src\mailbox.cpp:82)

We have tested this on three computers, and two of which will have this crash problem.
Can anyone help us?

@somdoron
Copy link
Member

So recently someone changed the signaler to use random port, this might
make firewall work harder, I'm not sure if it is part of 4.1.4 or 4.0.4,
but what I suggest is compiling with reverting the following commit:

7e09306

Or bottom line, in config.hpp make sure the signaler_port is set to 5905
and not 0.

Also in firewall make sure to allow tcp connection on port 5905 to the
application.

Let me know if this help.

On Wed, Feb 24, 2016 at 6:16 AM, JimChenTaiwan notifications@github.com
wrote:

Hi,
We have the same situation.
Environment:

  • Windows 10 x64
  • both ZMQ 4.0.4 and 4.1.4 x86
  • Compiler: VS2013 x86 build

Reproduction:

  • Always crashed at 2 hours after server & client connected.
  • error code:
  • Assertion failed: Connection reset by peer
    (......\src\signaler.cpp:298) or
  • Assertion failed: ok (......\src\mailbox.cpp:82)

We have tested this on three computers, and two of which will have this
crash problem.
Can anyone help us?


Reply to this email directly or view it on GitHub
#1808 (comment).

@JimChenTaiwan
Copy link

OK! Thanks for your suggestion, We will try it.

We also found another problem,
When we only executed the "zmq:socket->connect" side of zmq program,
it didn't crash after 2 hours, (because it is not connected?)
but it crashed when we close the "zmq:socket" and "zmq::context".
If we close the program before 2 hours, it will act OK!
It showed the error code:
  Assertion failed: Connection reset by peer (......\src\signaler.cpp:252)

@hintjens
Copy link
Member

We need to make this configurable, since reusing the same port leads to
different problems. Hence that patch.
On 24 Feb 2016 15:41, "Doron Somech" notifications@github.com wrote:

So recently someone changed the signaler to use random port, this might
make firewall work harder, I'm not sure if it is part of 4.1.4 or 4.0.4,
but what I suggest is compiling with reverting the following commit:

7e09306

Or bottom line, in config.hpp make sure the signaler_port is set to 5905
and not 0.

Also in firewall make sure to allow tcp connection on port 5905 to the
application.

Let me know if this help.

On Wed, Feb 24, 2016 at 6:16 AM, JimChenTaiwan notifications@github.com
wrote:

Hi,
We have the same situation.
Environment:

  • Windows 10 x64
  • both ZMQ 4.0.4 and 4.1.4 x86
  • Compiler: VS2013 x86 build

Reproduction:

  • Always crashed at 2 hours after server & client connected.
  • error code:
  • Assertion failed: Connection reset by peer
    (......\src\signaler.cpp:298) or
  • Assertion failed: ok (......\src\mailbox.cpp:82)

We have tested this on three computers, and two of which will have this
crash problem.
Can anyone help us?


Reply to this email directly or view it on GitHub
#1808 (comment).


Reply to this email directly or view it on GitHub
#1808 (comment).

@JimChenTaiwan
Copy link

[pic1]
1456385320653
[pic2]
1456385365827

Hi,
We only ran the built-in "local_thr.exe" as test program and didn't run the client program "remote_thr.exe".
After 2 hours, we found two TCP port states have been changed from "ESTABLISHED" to "FIN_WAIT1" as you can see in the [pic1].
And after a while, the "local_thr.exe" test program crashed as the [pic2] shows.
Why it always happened after 2 hours?
Thanks a lot.

@somdoron
Copy link
Member

Did you try same test with firewall disabled?

@JimChenTaiwan
Copy link

Yes, I have run some tests with firewall disabled, but it crashed after 2 hours.
I didn't test "local_thr.exe" with firewall disabled yet, I'll try it afterward.

We have tested "local_thr.exe" with firewall disabled, and it crashed after 2 hours. (2016/2/26)

@somdoron
Copy link
Member

somdoron commented Mar 9, 2016

@JimChenTaiwan are you using windows 10 as well?

@JimChenTaiwan
Copy link

Yes, I test it on Windows 10 x64 ver.

@ghost
Copy link

ghost commented Aug 22, 2016

I'm seeing what I think is identical behaviour here as well.

Running the following:

  • Windows 10 x64
  • zeromq compiled with Visual Studio 2013 x64 using f9c8687

After almost exactly two hours the following assert fires:

Assertion failed: Connection reset by peer (..\..\..\..\src\signaler.cpp:351)

Similar behaviour occurs if compiled with poll enabled rather than select:

Assertion failed: pfd.revents & POLLIN (..\..\..\..\src\signaler.cpp:248)

Running the same build of our software on Windows 7 doesn't demonstrate this issue.

I'm using inproc connections and running inproc_thr (with a Sleep call added at the bottom of the for loop in worker to slow things down to hit 2 hours) reproduces the issue. Invoked as inproc_thr 64 1000000. After two hours the assert above fires.

Additionally:

  • Changing signaler_port to 5905 doesn't help.
  • Disabling Windows firewall doesn't appear to help

@somdoron
Copy link
Member

@mseagrief does it matter if the socket is idle or not?

@somdoron
Copy link
Member

it seems like issue with windows 10, I tried to google it without much success, can you open an issue with Microsoft?

@ghost
Copy link

ghost commented Aug 24, 2016

@somdoron I'll do some tests today with the socket idle and report back.

As to reporting to Microsoft, I'm not sure what I'd be reporting. I'm not sufficiently familiar with the ZeroMQ internals to make a useful bug report. That said I'll spend some time in the next few days doing so and see if I can isolate the issue

@ghost
Copy link

ghost commented Aug 28, 2016

ZeroMQ 3.2.5 doesn't appear to exhibit this behaviour.

Just running zmq_init and then sleeping for 3 hours doesn't crash. However 5 minutes after starting to use the sockets (using inproc_thr) the assert fires.

Running zmq_init and then creating and binding a PULL socket and then sleeping asserts after 2hrs.

@ghost
Copy link

ghost commented Aug 28, 2016

It looks like the code added to address #1608 causes this behaviour to be triggered under Windows 10. Both the poll and select cases appear to work without triggering the assert with the code block in signaler.cpp commented out.

As for what the underlying issue is I'm unsure. I tried that code as it was the most significant change I could see in signaler.cpp comparing 3.2.5 against the current git master.

@tnthao
Copy link

tnthao commented Feb 1, 2017

having the same issue. easy way to reproduce this problem is have a large number of subscribers active and then simulate a congested/broken network using a tool called clumsy (i used ver 0.2). in the tool enable "drop" and "out of order" and increase the percentages. depending on your network load, it should crash eventually.

@lytboris
Copy link

lytboris commented Feb 7, 2017

#2334 is about connection reset by peer too. Take a look.

@SylvainCorlay
Copy link
Contributor

I am seeing a lot of these Assertion failed: nbytes == sizeof (dummy) in signaler.cpp:364.

@gdamore
Copy link

gdamore commented May 25, 2018

Something is resetting or closing TCP connections on the box. Its not the first time I've heard of this, but it is incredibly toxic behavior. In some circumstances it can be an overzealous admin, or firewall. I'd try to figure out why connections are randomly being spuriously disconnected.

That said, I still contend that designs that rely on this TCP loopback for signaling are stupid. With POSIX select and epoll driven loops you might not have a choice (but you usually have pipes or eventfds or socketpair to work around this problem), but on Windows using select() or WSApoll() to drive an event loop is stupid.

NNG doesn't do this.

I assume that ZeroMQ has a similar design to legacy nanomsg (mostly authored by the same person) and offers no other mechanism.

I doubt you can "fix" this other than by changing your design to use separate threads and blocking operations ... that's a big change, not very scalable, but it would completely eliminate the need for those loopback connections.

Or you can move away from ZMQ. (Have I mentioned NNG...?? )

I probably won't comment further on this thread, since I'm not affiliated with the ZMQ project in any way. (Peter tried to interest me in the project some months before his death, but I was too busy with nanomsg and really not interested in working on a C++/GPL based project.)

@mxcoppell
Copy link

Agreed - something is killing the connection. But that's not any messaging stacks. It's Windows TCP stack. Or maybe some unknown keep-alive logic running on the loopback type of connection. Unless NanoMsg has built-in logic to re-establish the connection, I don't think it can handle this scenario either - the connection is gone. It's not about select, epoll or iocp. It's a windows thing and we only find it on Windows 10.

We have separate data layers (handling the actually inbound & outbound queues). Right now, we are building a "reconnect" logic with ASIO to work around this.

@gdamore
Copy link

gdamore commented May 26, 2018

(I know I said I probably wouldn't comment again... but... I can't resist. Lol.)

You're partly right.

For TCP connections carrying user data, nanomsg, NNG, and ZMQ all will reconnect automatically.

The problem is that ZMQ and legacy nanomsg create a loopback connection using TCP only on Windows in order to provide support for getting a file descriptor that works with WSApoll() or select(). These programs use that connection to make the thread blocked in select or poll wake up (typically writing a byte, and then the library reads that byte after waking up in poll/select, etc.)

Under nanomsg, you know you're here if your program uses nn_getsockopt() to get the NN_RCVFD or NN_SNDFD. With ZeroMQ, its the ZMQ_FD.

The thing is, on Windows, this is an unnatural act. Some programs do this this because they were designed for UNIX systems, and use poll() at the heart of an event loop. But Windows has much better support for something called IO completion ports, and native asynchronous I/O mechanisms. So programs and runtimes designed for Windows from the beginning have no need for this hackery.

Unfortunately, both ZMQ and legacy nanomsg were designed for POSIX first and Windows was mostly an after thought. So they lack any kind of asynchronous support in their APIs, instead settling on nonblocking send and receive operations combined with that notification file descriptor.

NNG (the next generation of nanomsg) was designed to be asynchronous from the start, and uses callbacks (along with condition variables and mutex locks for synchronization) instead of poll-based solution. It offers the same descriptor based approach, but whereas in nanomsg and ZMQ that's the only way to do asynchronous or non-blocking I/O, in NNG it's a second class citizen dropped in favor of more scalable, portable, performant, and natural operations.

If your program can use threads and synchronization primitives instead of a central event loop, you might not need the ZMQ_FD. That would be more natural on Windows... while threads aren't super scalable on Windows, at least then you could use blocking socket operations and ditch the need for the TCP loopback connection altogether. (Apart from any connection you might be using for your data -- but as I said, ZMQ, like all other similar systems, automatically reconnects those connections.)

@mxcoppell
Copy link

NNG sounds interesting! I will give it a try. Thanks @gdamore !

But, the reconnect logic you mentioned provided by messaging stacks - that part I am not clear about. If the connection is terminated by the peer, it won't be reconnected by ZMQ without application code logic (not from ZMQ). This is what we are discussing here, isn't it? Or you are talking about from TCP protocol point of view?

@bluca
Copy link
Member

bluca commented May 26, 2018

I've seen people complain about Windows randomly killing TCP connections just because, in some cases it was a third party network management tool of some sort (nexus-something maybe?), check for that.

Besides that, I think Microsoft added af_unix support in Windows 10 - that could be a good alternative to TCP loopback that won't suffer from this problem and should be relatively easy to implement, feel free to send a PR if you like

@gdamore
Copy link

gdamore commented May 26, 2018 via email

@gdamore
Copy link

gdamore commented May 26, 2018 via email

@mxcoppell
Copy link

@bluca

When I said "It's a Window's thing", it's not accurate - it's actually something running on Windows 10 (anti-virus, packet filtering, monitoring) is interacting with Windows TCP/IP stack.

The "Connection Reset by Peer" in my case only happen on specific computers. Interesting enough, when we do the traffic test on these target machines, the data speed on recv side is much slower than send. When we do the same test on the machines do not have loopback connection drop issue, data speeds of send & recv have much much smaller difference.

@bluca
Copy link
Member

bluca commented May 26, 2018

Yes there were similar reports in the past, unfortunately I can't remember the name of the software that was messing stuff up, nor find the issue I'm thinking about

@mmortal03
Copy link

@bluca , Networx is what you're talking about. It's discussed earlier in this thread.

@mxcoppell
Copy link

The problem is that what's been installed on the client machines - there is no way to predict. Better have a logic to recover in application code.

@mrlonely001
Copy link

jupyter notebook carsh issue solved?

@liuqdev
Copy link

liuqdev commented Dec 2, 2018

OS: Windows 10
Jupyter Notebook: 5.6.0
Python: 3.7

Same problem on Jupyter Notebook. I have reinstalled anaconda and jupyter, but:

Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:201)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)
[I 18:33:53.765 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
WARNING:root:kernel 597559d6-5ccb-42f9-b4b0-803047dc72f6 restarted
WARNING:root:kernel 597559d6-5ccb-42f9-b4b0-803047dc72f6 restarted
Assertion failed: Connection reset by peer [10054] (bundled\zeromq\src\signaler.cpp:379)

Same situation. About every 2 hours .Jupyter notebook will crash.

@liuqdev
Copy link

liuqdev commented Dec 18, 2018

After:

pip uninstall networkx

it still shows:

Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:181)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)
Assertion failed: Connection reset by peer (bundled\zeromq\src\signaler.cpp:298)

@karl-cenliming
Copy link

hi guys,
any update for this issue.

@roushanprasad
Copy link

I got this problem after I installed 'basemap' package to plot lat longs. Before that my Jupyter notebook was running perfectly fine. Anyone got any fix for this one ?

@GF-Huang
Copy link

GF-Huang commented Sep 5, 2019

Hey guys,

I searched by Google: WFP zeromq, found this page. I think I should be able to provide some useful information.

I'm developing a network traffic redirection driver based on WFP.

During the test, I ran various software to verify that my driver is running correctly. Which including OpenShot Video Editor. (Its github page)

According to my driver design, local loopback traffic should not be redirected, they should be bypass by WFP filter. All other softwares work as expected.

Only the OpenShot Video Editor not works as expected: There is a loopback TCP connection that is captured by my driver when OpenShot starts.

So, I think look at the source code of OpenShot and see why it doesn't works as expected. Finally, I found the code cause the loopback connection. It's using zeromq. So I searched by google then found this page...

Unfortunately, I don't understand the implementation of zeromq. At present, everyone seems to have no way of figuring out what caused it.


Sorry guys.. I found that a bug in my code.. Because a wrong filter weight.

@bluca
Copy link
Member

bluca commented Sep 5, 2019

The implementation is quite simple in this regard: on Windows there's a TCP socket bound on INADDR_LOOPBACK and random port (0). If something on the system messes with that loopback connections, things go awry.

@bluca
Copy link
Member

bluca commented Sep 5, 2019

@maajdl
Copy link

maajdl commented Sep 11, 2019

Something is resetting or closing TCP connections on the box. Its not the first time I've heard of this, but it is incredibly toxic behavior. In some circumstances it can be an overzealous admin, or firewall. I'd try to figure out why connections are randomly being spuriously disconnected.

That said, I still contend that designs that rely on this TCP loopback for signaling are stupid. With POSIX select and epoll driven loops you might not have a choice (but you usually have pipes or eventfds or socketpair to work around this problem), but on Windows using select() or WSApoll() to drive an event loop is stupid.

NNG doesn't do this.

I assume that ZeroMQ has a similar design to legacy nanomsg (mostly authored by the same person) and offers no other mechanism.

I doubt you can "fix" this other than by changing your design to use separate threads and blocking operations ... that's a big change, not very scalable, but it would completely eliminate the need for those loopback connections.

Or you can move away from ZMQ. (Have I mentioned NNG...?? )

I probably won't comment further on this thread, since I'm not affiliated with the ZMQ project in any way. (Peter tried to interest me in the project some months before his death, but I was too busy with nanomsg and really not interested in working on a C++/GPL based project.)

Hello Garrett,

I saw your posts here regarding the error ‘Assertion failed: Connection reset by peer’ occurring in zmq.
I have asked at many places for a solution or work-around for this issue, but I could get no answer.
See for example my post on jupyter.org (I get this issues as a full time Jupyter user on Windows 8.1)

I was suggested that ‘some application’ would trigger this behavior, like the Firewall or a network monitoring application.
Could you give me suggestion about how I could identify a responsible application?
I don’t know how to monitor such things.

Thanks,

Michel

@zhangx258
Copy link

Same problem on Jupyter notebook,
any updates for this issue.

@sigiesec
Copy link
Member

sigiesec commented Dec 5, 2019

I recently implemented the ipc transport under Windows 10 using Unix domain sockets. It should be possible to also use Unix domain sockets on Windows 10 for the signaler socket. Since we don't know what causes the connection resets for the current TCP/IP signaler sockets, we cannot be sure if this solves the problem, but there is a chance. If someone else wants to work on this, have a look at the changes made by #3717. zmq::fd_pair must be adapted to use Unix domain sockets if available (which can be checked either at configuration time or at run-time). There is no socketpair on Windows 10 though, so the pair of sockets must be created manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.