Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket monitor hangs if zmq_bind or zmq_setsockopt is failed in ::monitor #1315

Closed
artemale opened this issue Jan 20, 2015 · 9 comments · Fixed by zeromq/zeromq4-1#74
Closed

Comments

@artemale
Copy link

Hello,

Version: 4.0.4
RHEL: 6.1

Scenario:

  1. try to create second monitor for same connection (this is invalid use and happened by wrong user code, fixed)
  2. monitor socket bind is failed and rc = -1.
  3. stop_monitor called
  4. existing subscription for ZMQ_EVENT_MONITOR_STOPPED
  5. hanging user code at in zmq::signaler_t::wait (this=..., timeout_=-1) at signaler.cpp:173

Analysis:

  1. wrong usage of zero mq is fixed to create single monitor. It is not a point.
  2. int zmq::socket_base_t::monitor (const char *addr_, int events_) calls stop_monitor if 1) zmq_setsockopt failed and 2) zmq_bind is failed.
  3. I think this is wrong to call stop_monitor if socket is not bound since in case ZMQ_EVENT_MONITOR_STOPPED subscription exists it will call zmq_sendmsg on monitor_socket which is not bound to address or failed to set socket option.

At least I could not find another answer why user thread hangs. I apologize if idea is wrong.

Stack:
Thread 1 (Thread 0x7f80d2e23740 (LWP 15523)):
#0 0x00007f80c82e5053 in poll () from /lib64/libc.so.6
#1 0x00007f80cb341056 in zmq::signaler_t::wait (this=0x1f6816360, timeout_=-1) at signaler.cpp:173
#2 0x00007f80cb33217e in zmq::mailbox_t::recv (this=0x1f6816300, cmd_=0x7fff786b5e20, timeout_=-1) at mailbox.cpp:72
#3 0x00007f80cb341954 in zmq::socket_base_t::process_commands (this=0x1f6816000, timeout_=, throttle_=false) at socket_base.cpp:872
#4 0x00007f80cb341d80 in zmq::socket_base_t::send (this=0x1f6816000, msg_=0x7fff786b5ed0, flags_=) at socket_base.cpp:724
#5 0x00007f80cb35575a in s_sendmsg (s_=0x1f6816000, msg_=0x7fff786b5ed0, flags_=2) at zmq.cpp:350
#6 0x00007f80cb3412d8 in zmq::socket_base_t::monitor_event (this=0x8013700, event_=, addr_="") at socket_base.cpp:1249
#7 0x00007f80cb34366a in zmq::socket_base_t::stop_monitor (this=0x8013700) at socket_base.cpp:1265
#8 0x00007f80cb343886 in zmq::socket_base_t::monitor (this=0x8013700, addr_=0x1f537b0a8 "inproc://monitor.sock.KROMBERG_RTA2ZMQ_REQ0", events_=2047) at socket_base.cpp:1133

Aleksei.

@oliora
Copy link

oliora commented Jan 26, 2015

Could be related to #1279

@artemale
Copy link
Author

artemale commented Jul 6, 2015

Happened again...scenario is disconnecting and connecting REQ-REP again. Monitor was stopped by calling socket monitor with NULL address:
zmq_socket_monitor (m_sock, NULL, ZMQ_EVENT_ALL))

after it, we called valid zmq_socket_monitor and again: bind is failed and call hanged.

@artemale
Copy link
Author

artemale commented Jul 6, 2015

Looking for a workaround by removing subscription for monitor stopped event...

@artemale
Copy link
Author

artemale commented Jul 6, 2015

  1. I removed subscription to MONITOR_STOPPED, so no monitor event will be issued
  2. I need to understand why zmq_bind is failed when fast disconnect-connect is called. Suspect monitor's socket must take some time to be released at TCP level. Gave 500 ms sleep before doing bind

I am talking to myself :)

@oliora
Copy link

oliora commented Jul 6, 2015

What error you got from zmq_bind? Kind of "address in use"?

@artemale
Copy link
Author

artemale commented Jul 6, 2015

Not possible to check without internal modification, but I think yes:
rc = zmq_bind (monitor_socket, addr_);
if (rc == -1)
stop_monitor ();

rc == -1 and we go to stop monitor.

@oliora
Copy link

oliora commented Jul 6, 2015

The reason could be in the following. ZeroMQ closes connection asynchronously in separate thread. So closing connection and immediately after it binding to the same address can lead to error because address is still in use. Try to add a small sleep after closing a monitor, probably it will help.

@artemale
Copy link
Author

artemale commented Jul 6, 2015

Thanks, I've done this already and will be writing a unit test tomorrow. Anyway, once got to a trap in the wait(), program will not possible to continue. I am also wondering why send timeout was -1 i my case while sending monitor event and hanged. I checked our application - it sets snd timeout to 10 seconds by setsockopt.

Hope I will workaround this issue for myself by adding a sleep...

@wcs1only
Copy link
Contributor

wcs1only commented Oct 9, 2015

So I ran into this issue and did some digging. Turns out you can reliably reproduce this issue simply by reusing the same monitor address twice.

int main (void)
{
     void *ctx = zmq_ctx_new ();
     assert (ctx);
     void *client = zmq_socket (ctx, ZMQ_DEALER);
     assert (client);
     void *server = zmq_socket (ctx, ZMQ_DEALER);
     assert (server);

     // Monitor all events on client and server sockets
     rc = zmq_socket_monitor (client, "inproc://monitor-client", ZMQ_EVENT_ALL);
     assert (rc == 0);

     // This should fail, but instead it hangs
     rc = zmq_socket_monitor (server, "inproc://monitor-client", ZMQ_EVENT_ALL);
     assert (rc == -1);

     // Create two sockets for collecting monitor events
     void *client_mon = zmq_socket (ctx, ZMQ_PAIR);
     assert (client_mon);

     // Connect these to the inproc endpoints so they'll get events
     rc = zmq_connect (client_mon, "inproc://monitor-client");
     assert (rc == 0);
     //rc = zmq_connect (server_mon, "inproc://monitor-server");
     assert (rc == 0);

     // Now do a basic ping test
     rc = zmq_bind (server, "tcp://127.0.0.1:9998");
     assert (rc == 0);
     rc = zmq_connect (client, "tcp://127.0.0.1:9998");
     assert (rc == 0);

     // Close client and server
     zmq_close (client);
     zmq_close (server);

     // Close down the sockets
     zmq_close (client_mon);
     //close_zero_linger (server_mon);
     zmq_ctx_term (ctx);

     return 0 ; }

The actually hang is caused by stop_monitor trying to send an event on a socket for which bind has failed. An easy solution would be to simply not send an event when we fail to create or bind the monitor socket. I have a patch, I'll follow up with a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants