Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_immediate_3/test_reconnect_inproc do not terminate with POLL poller under Windows #3107

Closed
sigiesec opened this issue May 15, 2018 · 10 comments

Comments

@sigiesec
Copy link
Member

Please use this template for reporting suspected bugs or requests for help.

Issue description

test_immediate test case test_immediate_3 does not terminate with POLL poller under Windows

Environment

  • libzmq version (commit hash if unreleased): 4.2.5 and d81a041
  • OS: Windows
  • Poller: POLL

Minimal test code / Steps to reproduce the issue

  1. Run test_immediate

What's the actual result? (include assertion message & call stack if applicable)

Blocks forever.

What's the expected result?

Terminates without error.

@sigiesec
Copy link
Member Author

The main thread is in:

 	ntdll.dll!ZwWaitForSingleObject�()	Unknown
 	mswsock.dll!__GSHandlerCheck_SEH�()	Unknown
 	mswsock.dll!MSAFD_WSPPoll�()	Unknown
 	mswsock.dll!__GSHandlerCheck_SEH�()	Unknown
 	ws2_32.dll!WSAIoctl�()	Unknown
 	ws2_32.dll!WSAPoll�()	Unknown
>	libzmq-v141-mt-gd-4_3_1.dll!poll(pollfd * pfd=0x000000000023f468, unsigned long nfds=1, int timeout=-1) Line 90	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::signaler_t::wait(int timeout_=-1) Line 248	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::mailbox_t::recv(zmq::command_t * cmd_=0x000000000023f5c0, int timeout_=-1) Line 81	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::socket_base_t::process_commands(int timeout_=-1, bool throttle_=false) Line 1324	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::socket_base_t::send(zmq::msg_t * msg_=0x000000000023f778, int flags_=0) Line 1139	C++
 	libzmq-v141-mt-gd-4_3_1.dll!s_sendmsg(zmq::socket_base_t * s_=0x00000000002f4440, zmq_msg_t * msg_=0x000000000023f778, int flags_=0) Line 328	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq_send(void * s_=0x00000000002f4440, const void * buf_=0x000000013f5810a8, unsigned __int64 len_=5, int flags_=0) Line 360	C++
 	test_immediate.exe!send_string_expect_success(void * socket=0x00000000002f4440, const char * str=0x000000013f5810a8, int flags=0) Line 101	C++
 	test_immediate.exe!test_immediate_3() Line 214	C++
 	test_immediate.exe!UnityDefaultTestRun(void(*)() Func=0x000000013f57121c, const char * FuncName=0x000000013f581608, const int FuncLineNum=229) Line 1341	C
 	test_immediate.exe!main() Line 230	C++
 	test_immediate.exe!invoke_main() Line 79	C++
 	test_immediate.exe!__scrt_common_main_seh() Line 283	C++
 	test_immediate.exe!__scrt_common_main() Line 326	C++
 	test_immediate.exe!mainCRTStartup() Line 17	C++
 	kernel32.dll!BaseThreadInitThunk�()	Unknown
 	ntdll.dll!RtlUserThreadStart�()	Unknown

The I/O thread is in

>	ntdll.dll!ZwWaitForSingleObject�()	Unknown
 	mswsock.dll!__GSHandlerCheck_SEH�()	Unknown
 	mswsock.dll!MSAFD_WSPPoll�()	Unknown
 	mswsock.dll!__GSHandlerCheck_SEH�()	Unknown
 	ws2_32.dll!WSAIoctl�()	Unknown
 	ws2_32.dll!WSAPoll�()	Unknown
 	libzmq-v141-mt-gd-4_3_1.dll!poll(pollfd * pfd=0x00000000002ff900, unsigned long nfds=1, int timeout=-1) Line 90	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::poll_t::loop() Line 162	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::worker_poller_base_t::worker_routine(void * arg_=0x00000000002ef840) Line 136	C++
 	libzmq-v141-mt-gd-4_3_1.dll!thread_routine(void * arg_=0x00000000002ef880) Line 51	C++
 	ucrtbased.dll!invoke_thread_procedure(unsigned int(*)(void *) procedure=0x000007fec7fc7fb0, void * const context=0x00000000002ef880) Line 92	C++
 	ucrtbased.dll!thread_start<unsigned int (__cdecl*)(void * __ptr64)>(void * const parameter=0x000000000030f070) Line 115	C++
 	kernel32.dll!BaseThreadInitThunk�()	Unknown
 	ntdll.dll!RtlUserThreadStart�()	Unknown

@sigiesec
Copy link
Member Author

Apparently this has nothing to do with ZMQ_IMMEDIATE. It also happens when removing the ZMQ_IMMEDIATE option, but trying to connect another ZMQ_DEALER peer after closing the first.

@sigiesec
Copy link
Member Author

This is demonstrated by https://github.com/sigiesec/libzmq/tree/fix-win-poll-deadlock (CI job https://ci.appveyor.com/project/sigiesec/libzmq/build/build-547/job/ha0n591d1w67dtlg)

I have no idea at the moment how this affected by the polling implementation.

@sigiesec
Copy link
Member Author

To avoid I am investigating non-sense, could someone please check if the test case in https://github.com/sigiesec/libzmq/blob/aa3ed7e9da24ff214919b6a58465ec26265ecd25/tests/test_spec_dealer.cpp#L202 is correct and is expected to work? It happens to work on all builds other than Windows poll, but this might be coincidence.

@sigiesec
Copy link
Member Author

sigiesec commented May 15, 2018

I instrumented libzmq to output all commands processed by a socket (and some debug output from the test case).

With POLLER=select, I get:

backend socket zmq_bind returned
socket 1: plug
socket 1: own
socket 2: plug
socket 2: plug
socket 2: own
socket 1: plug
socket 1: own
socket 1: attach
socket 1: bind
socket 2: attach
socket 2: own
socket 2: term_req
socket 2: term
socket 2: term_ack
socket -1: activate_read
backend socket zmq_close returned
socket -1: activate_read
socket -1: reap
socket -1: pipe_term
socket -1: activate_read
socket 1: term
socket -1: pipe_term_ack
socket 1: term
socket -1: pipe_term_ack
socket 1: term_ack
socket 1: term_ack
socket -1: reaped
socket 2: plug
socket 2: own
backend socket recreating
backend socket zmq_bind returned
socket 3: plug
socket 3: own
socket 2: attach # everything from here is missing below
socket 2: term_req
socket 2: term
socket 2: term_ack
socket 3: plug
socket 3: own
socket 3: attach
socket 3: bind
socket -1: activate_read
socket -1: activate_read
socket -1: reap
socket -1: reap
socket -1: pipe_term
socket 3: stop
socket -1: activate_read
socket 2: stop
socket 3: term
socket -1: pipe_term_ack
socket -1: pipe_term
socket -1: activate_read
socket 2: term
socket -1: pipe_term_ack
socket 3: term
socket -1: pipe_term_ack
socket -1: pipe_term_ack
socket 2: term_ack
socket 3: term_ack
socket -1: reaped
socket 3: term_ack
socket -1: stop
socket -1: reaped
socket -1: stop
d:\nls\libzmq\tests\test_spec_dealer.cpp:280:test_reconnect_inproc:PASS

With POLLER=poll, it blocks after:

backend socket zmq_bind returned
socket 1: plug
socket 2: plug
socket 1: own
socket 2: plug
socket 2: own
socket 1: plug
socket 1: own
socket 1: attach
socket 1: bind
socket 2: attach
socket 2: own
socket 2: term_req
socket 2: term
socket 2: term_ack
socket -1: activate_read
backend socket zmq_close returned
socket -1: activate_read
socket -1: reap
socket -1: pipe_term
socket -1: activate_read
socket 1: term
socket -1: pipe_term_ack
socket 1: term
socket -1: pipe_term_ack
socket 1: term_ack
socket 1: term_ack
socket -1: reaped
socket 2: plug
socket 2: own
backend socket recreating
socket 3: plug
backend socket zmq_bind returned
socket 3: own

@bluca
Copy link
Member

bluca commented May 15, 2018

Test case looks valid to me.

What happens if you call zmq_unbind before closing the socket? What if instead of binding, it connects?

@sigiesec
Copy link
Member Author

sigiesec commented May 15, 2018

@bluca Did not try this yet, but will later.

But I found another difference, which sounds relevant.

In the good run with POLLER=select, the following call stack is reached:

>	libzmq-v141-mt-gd-4_3_1.dll!zmq::tcp_connecter_t::out_event() Line 147	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::tcp_connecter_t::in_event() Line 132	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::select_t::trigger_events(const std::vector<zmq::select_t::fd_entry_t,std::allocator<zmq::select_t::fd_entry_t> > & fd_entries_={...}, const zmq::select_t::fds_set_t & local_fds_set_={...}, int event_count_=1) Line 154	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::select_t::select_family_entry(zmq::select_t::family_entry_t & family_entry_={...}, int max_fd_=0, bool use_timeout_=false, timeval & tv_={...}) Line 441	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::select_t::loop() Line 409	C++
 	libzmq-v141-mt-gd-4_3_1.dll!zmq::worker_poller_base_t::worker_routine(void * arg_=0x000000000031bda0) Line 136	C++
 	libzmq-v141-mt-gd-4_3_1.dll!thread_routine(void * arg_=0x000000000031bde0) Line 51	C++

tcp_connecter_t::out_event() Line 147 is never reached with POLLER=poll

It is this branch:

close ();

The pending connect had failed with error 10061 (WSAECONNREFUSED) before.

@sigiesec sigiesec changed the title test_immediate_3 does not terminate with POLL poller under Windows test_immediate_3/test_reconnect_inproc do not terminate with POLL poller under Windows May 17, 2018
@sigiesec
Copy link
Member Author

Ok, some more research showed that this is far more severe than I thought. If the information here is correct and I understand it correctly, this is impossible to fix under Windows, as WSAPoll simply does not report connection failures:

Can someone please double-check this?

While this seems to affect only a few existing test cases, it is non-foreseeable for a user when it might happen. It is probably only relevant for the I/O thread (i.e. poll.cpp), since a socket_poller/zmq_poller_poll would never poll for the result of a connect attempt.

The conclusion would need to be to change the build scripts to disallow the use of poll/WSApoll on Windows completely. select would then be the only poller_t implementation on Windows for now. If my assumption above is correct, we could still support poll for zmq_poll/zmq_poller_*/socket_poller_t.

@sigiesec
Copy link
Member Author

E.g. python rejected to support poll on Windows for similar reasons: https://bugs.python.org/issue16507

@bluca
Copy link
Member

bluca commented May 17, 2018

Yikes! I have no idea about WSAPoll but if that's the case, then I agree that we should remove the option for the I/O thread.

sigiesec added a commit to sigiesec/libzmq that referenced this issue May 22, 2018
Solution: added test case, reproduces the likely cause for zeromq#3107
sigiesec added a commit to sigiesec/libzmq that referenced this issue May 22, 2018
Solution: added test case, reproduces the likely cause for zeromq#3107
MohammadAlTurany pushed a commit to FairRootGroup/libzmq that referenced this issue Jan 28, 2019
Solution: added test case, reproduces the likely cause for zeromq#3107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants