Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed in signaler again #3756

Closed
vojtech-frodl opened this issue Dec 9, 2019 · 7 comments · Fixed by #3757
Closed

Assertion failed in signaler again #3756

vojtech-frodl opened this issue Dec 9, 2019 · 7 comments · Fixed by #3757

Comments

@vojtech-frodl
Copy link

vojtech-frodl commented Dec 9, 2019

Issue description

I encountered the same issue as described in #2360 which was solved some time ago by commit bcf7577. This solution was reverted by commit 4f77cfa with a commit message "Removed unreachable code paths". Out of curiosity, I added the while loop again and observed that the issue disappeared. Can you please confirm that commit 4f77cfa does what was intended and doesn't cause regression?

Environment

  • libzmq version: v4.3.2
  • OS: Windows 10 64-bit

Minimal test code / Steps to reproduce the issue

The assertion fails in proxy() with sockets A (router) and B (dealer). Socket B connects to a remote location. Socket A is bound to an inproc address and multiple threads connect to it and disconnect from it dynamically. All sockets are created in the thread which uses them.
I noticed that you increase chances of the crash significantly if you first run a loop in a separate thread in which you create a new req socket, connect it to A, destroy the req socket and repeat. Let the thread run for a few seconds and then request some data from the remote location by some other req socket connected to A and the crash is virtually guaranteed.

What's the actual result?

::send() returns SOCKET_ERROR causing the following assertion to fail: zmq_assert (nbytes == sizeof (dummy));
Callstack:

[...]
RaiseException()	Unknown
zmq::zmq_abort(const char * errmsg_) Line 90	C++
zmq::signaler_t::send() Line 224	C++
zmq::mailbox_t::send(const zmq::command_t & cmd_={...}) Line 67	C++
zmq::object_t::send_pipe_term_ack(zmq::pipe_t * destination_) Line 344	C++
zmq::pipe_t::process_delimiter() Line 499	C++
zmq::pipe_t::read(zmq::msg_t * msg_=0x00000214049f9980) Line 218	C++
zmq::fq_t::recvpipe(zmq::msg_t * msg_=0x00000214049f9980, zmq::pipe_t * * pipe_=0x0000004d41aff310) Line 98	C++
zmq::router_t::xhas_in() Line 383	C++
zmq::socket_base_t::getsockopt(int option_, void * optval_=0x0000004d41aff3b0, unsigned __int64 * optvallen_=0x0000004d41aff3d0) Line 457	C++
zmq_poll(zmq_pollitem_t * items_=0x0000004d41aff890, int nitems_, long timeout_=-1) Line 1033	C++
zmq::proxy(zmq::socket_base_t * frontend_=0x00000214049f93e0, zmq::socket_base_t * backend_=0x0000021379010d20, zmq::socket_base_t * capture_=0x0000000000000000, zmq::socket_base_t * control_=0x0000000000000000) Line 575	C++
zmq::proxy(zmq::socket_ref frontend={...}, zmq::socket_ref backend={...}, zmq::socket_ref capture={...}) Line 1541	C++
[...]
@sigiesec
Copy link
Member

Thanks for reporting this.

This is a bit more difficile though. See #2360 and #2362.

Probably the loop should be put back in without the immediate assertion wsa_assert (nbytes != SOCKET_ERROR); but allowing for some retries?

However, this doesn't really match your report. You are saying that send returned SOCKET_ERROR, but in that case the first assertion wsa_assert (nbytes != SOCKET_ERROR); should have failed, but you report the second assertion zmq_assert (nbytes == sizeof (dummy)); failed. Can you re-check that? Can you tell me what values nbytes and sizeof (dummy) have when the crash occurs?

@vojtech-frodl
Copy link
Author

I did some debugging and the following came up:

  1. send() returns -1 (SOCKET_ERROR)
  2. wsa_assert() calls wsa_error() which returns NULL because WSAGetLastError() returns WSAEWOULDBLOCK (see the switch case in wsa_error_no()). wsa_assert() then doesn't abort because errstr is NULL.
  3. zmq_assert (nbytes == sizeof (dummy)); fails because -1 != 1.

@sigiesec
Copy link
Member

Ah, sorry I missed that wsa_assert does not abort on WSAEWOULDBLOCK. Now it makes sense to me. I will take care of that.

@sigiesec
Copy link
Member

@vojtech-frodl I hope the new code fixes the regression.

I just wonder if you have a test case that reproduces this near-deterministically. Interestingly, I have never seen this happen in the existing tests. While the change may fix the assertion, it might be that there is some underlying performance issue in some use case, which causes the socket to behave like this. It would be great to be able to investigate this further.

@vojtech-frodl
Copy link
Author

Yes, I can confirm that the regression is fixed now.

I have quite a solid repro routine but it's far from a minimal reproducer. I can try to minimize it at the weekend.

@vojtech-frodl
Copy link
Author

vojtech-frodl commented Dec 15, 2019

Here's a reproducer.
main.zip
It's not as reliable as my original routine but if you're lucky, it will crash in about an hour. You can have everything built in Debug configuration.
Here's my console output.

client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
server worker reconnects
server worker reconnects
server worker reconnects
server worker reconnects
server worker reconnects
server worker reconnects
server worker reconnects
server worker reconnects
Assertion failed: nbytes == sizeof (dummy) (C:\libs\zeromq-4.3.2\src\signaler.cpp:190)
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects
client worker reconnects

@barometz
Copy link

barometz commented May 6, 2020

I'm very late to this, but:

Before I discovered that this is already fixed in master, I spent a few hours reproducing this issue in 4.3.2. Maybe my work can do some good, if you're still looking for a more reliable reproduction. I have a test case that trips this assertion within 20 seconds or so with 4.3.2 on my system, see https://github.com/barometz/zmq-parallel-nbytes. Built with MSVC 14.2 (VS2019), x64 target.

May have to twiddle the tasks/cycles a bit.

Update: if necessary, I have a sequence of less horribly parallel tests that fails frequently on my own computer and consistently in CI - I can probably turn that into a better test case for this issue, but it'll take some time. Let me know if that's necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants