Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hwm tests #3242

Merged
merged 23 commits into from
Sep 13, 2018
Merged

Hwm tests #3242

merged 23 commits into from
Sep 13, 2018

Conversation

f18m
Copy link
Contributor

@f18m f18m commented Sep 4, 2018

Problem: correct HWM handling is checked only on INPROC transport
Problem: the ZMQ proxy control socket is unusable when the proxy reaches its XPUB HWM and lossy flag is set.

Solution:

  • extend existing HWM unit test to test also TCP and IPC transports
  • add new unit test that shows that ZMQ proxy control socket is unusable when the proxy reaches its XPUB HWM and lossy flag is set and how to cope with that.

Note: this PR is still work-in-progress. In particular I need to think how to organize the new unit test on the proxy; currently it requires human inspection to understand what's going on.

// send 1000 msg on hwm 1000, receive 1000
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000));
// send 1000 msg on hwm 1000, receive 1000, on TCP transport
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, ENDPOINT_0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause tests that run in parallel to fail, please use "tcp://127.0.0.1:*" and change the code to use the _ENDPOINT socket option

@bluca
Copy link
Member

bluca commented Sep 4, 2018

Please also add the new test to tests/CMakeLists.txt

@bluca bluca changed the title Hwm tests [WIP] Hwm tests Sep 4, 2018
@f18m
Copy link
Contributor Author

f18m commented Sep 8, 2018

Hi @bluca,
I modified the "test_hwm_pubsub" to use random port for TCP.
I added "test_proxy_hwm" to CMake and also cleaned it up. I added comments at the beginning of that source file to explain what's its purpose.

I'll keep an eye on automatic appVeyor and TravisCI checks !

Thanks


TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "tcp://127.0.0.1:*"));
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "inproc://a"));
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "ipc://@tmp-tester"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPC does not work on Windows so it will fail, this line and the other one need ifndef ZMQ_HAVE_WINDOWS

@f18m
Copy link
Contributor Author

f18m commented Sep 8, 2018

Hi @bluca,
I disabled IPC test on Windows and also tried to address some shortcomings in docs.

@@ -854,11 +858,13 @@ If this limit has been reached the socket shall enter an exceptional state and
depending on the socket type, 0MQ shall take appropriate action such as
blocking or dropping sent messages. Refer to the individual socket descriptions
in linkzmq:zmq_socket[3] for details on the exact action taken for each socket
type.
type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this extra whitespace

@bluca
Copy link
Member

bluca commented Sep 8, 2018

The HWM tests are failing on Travis, could you please check? Thanks

@f18m
Copy link
Contributor Author

f18m commented Sep 9, 2018

I checked and some of the failures are related to TCP transport testcases failing due to timing issues... I moved a check after a wait of SETTLE_TIME ms to improve test reliability.

Other builds are failing due to cclang issues about source code formatting... do you have any hint on how to fix these? Thanks!

@bluca
Copy link
Member

bluca commented Sep 9, 2018

There's only one clang format job and it prints the diff that needs to be applied, you can also run make format-check

But there's also the IPC test failing to bind on osx


void test_ipc ()
{
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "ipc://@tmp-tester1"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue on OSX is that this is trying to use the abstract namespace, which as the zmq_ipc manpage says it's linux only - you can use /tmp or a wildcard ipc://* and the _endpoint call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, should be fixed now

@f18m
Copy link
Contributor Author

f18m commented Sep 11, 2018

@bluca : is there some special macro to emit debugging prints from a unit test source file?
I was using in test_proxy_hwm.cpp a custom macro taking variadic list but I see some Travis builds fail with:

../tests/test_proxy_hwm.cpp:63:23: error: anonymous variadic macros were introduced in C99 [-Werror=variadic-macros]

so shall I use printf() directly? Shall I remove debug prints (this could be a bad idea for debugging in future)?

Thanks

@f18m
Copy link
Contributor Author

f18m commented Sep 11, 2018

Moreover I see that on OSX the trick of using "ipc://*" and then ZMQ_LAST_ENDPOINT seems to fail:

../../tests/test_hwm_pubsub.cpp:55:test_ipc:FAIL: zmq_getsockopt (pub_socket, ZMQ_LAST_ENDPOINT, pub_endpoint, &len) failed, errno = 22 (Invalid argument)

am I missing something? Is this a bug in ZMQ OSX implementation?

@f18m
Copy link
Contributor Author

f18m commented Sep 13, 2018

well maybe I found the root cause of the memleak actually: even if valgrind reports the leak on the PUB side it is probably due to the SUB side thread.
The fact is that proxy_hwm test is using inproc transport so that the message is not allocated on the SUB side...
I hope it's fixed now

TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (
subsocket, ZMQ_RCVTIMEO, &timeout_ms, sizeof (timeout_ms)));
} else {
TEST_ASSERT_SUCCESS_ERRNO (zmq_msg_close (&msg));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this close call needs to be in the other branch of the if, that's what's causing the leak

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right - my fault.. I intended to puth the close on the success path indeed

@bluca
Copy link
Member

bluca commented Sep 13, 2018

It's still there, see comment inline for the fix

@bluca bluca changed the title [WIP] Hwm tests Hwm tests Sep 13, 2018
@bluca bluca merged commit eb3e63e into zeromq:master Sep 13, 2018
@bluca
Copy link
Member

bluca commented Sep 13, 2018

@f18m thanks for your PR, it looks good now, merged.

Could you please send another PR with a relicensing grant. See README here for details: https://github.com/zeromq/libzmq/tree/master/RELICENSE and this example: https://github.com/zeromq/libzmq/blob/master/RELICENSE/WenbinHou.md

@bluca
Copy link
Member

bluca commented Sep 13, 2018

@f18m actually it's causing a failure to build with older gcc:

[  249s] 11 Tests 0 Failures 0 Ignored 
[  249s] OK
[  249s] PASS: tests/test_inproc_connect
[  249s] PASS: tests/test_issue_566
[  250s] PASS: tests/test_proxy
[  251s] *** longjmp causes uninitialized stack frame ***: /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm terminated
[  251s] ======= Backtrace: =========
[  251s] /lib/libc.so.6(__fortify_fail+0x4d)[0xf73e829d]
[  251s] /lib/libc.so.6(+0xfd20a)[0xf73e820a]
[  251s] /lib/libc.so.6(__longjmp_chk+0x49)[0xf73e8179]
[  251s] /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm[0x804c2ec]
[  251s] /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm[0x804a505]
[  251s] /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/src/.libs/libzmq.so.5(+0x5ee93)[0xf7779e93]
[  251s] /lib/libpthread.so.0(+0x6bc9)[0xf75c9bc9]
[  251s] /lib/libc.so.6(clone+0x5e)[0xf73ce11e]
[  251s] ======= Memory map: ========
[  251s] 08048000-08050000 r-xp 00000000 08:00 754184                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm
[  251s] 08050000-08051000 rw-p 00007000 08:00 754184                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm
[  251s] 09d5f000-09d80000 rw-p 00000000 00:00 0                                  [heap]
[  251s] f3a00000-f3a7b000 rw-p 00000000 00:00 0 
[  251s] f3a7b000-f3b00000 ---p 00000000 00:00 0 
[  251s] f3c00000-f3c21000 rw-p 00000000 00:00 0 
[  251s] f3c21000-f3d00000 ---p 00000000 00:00 0 
[  251s] f3d00000-f3dfe000 rw-p 00000000 00:00 0 
[  251s] f3dfe000-f3e00000 ---p 00000000 00:00 0 
[  251s] f3e00000-f3e21000 rw-p 00000000 00:00 0 
[  251s] f3e21000-f3f00000 ---p 00000000 00:00 0 
[  251s] f3ffc000-f3ffd000 ---p 00000000 00:00 0 
[  251s] f3ffd000-f47fd000 rw-p 00000000 00:00 0 
[  251s] f47fd000-f47fe000 ---p 00000000 00:00 0 
[  251s] f47fe000-f4ffe000 rw-p 00000000 00:00 0 
[  251s] f4ffe000-f4fff000 ---p 00000000 00:00 0 
[  251s] f4fff000-f57ff000 rw-p 00000000 00:00 0 
[  251s] f57ff000-f5800000 ---p 00000000 00:00 0 
[  251s] f5800000-f6000000 rw-p 00000000 00:00 0 
[  251s] f6000000-f6022000 rw-p 00000000 00:00 0 
[  251s] f6022000-f6100000 ---p 00000000 00:00 0 
[  251s] f6159000-f615a000 ---p 00000000 00:00 0 
[  251s] f615a000-f695a000 rw-p 00000000 00:00 0 
[  251s] f695a000-f695b000 ---p 00000000 00:00 0 
[  251s] f695b000-f715b000 rw-p 00000000 00:00 0 
[  251s] f715b000-f7167000 r-xp 00000000 08:00 564188                             /lib/libnss_files-2.12.so
[  251s] f7167000-f7168000 r--p 0000b000 08:00 564188                             /lib/libnss_files-2.12.so
[  251s] f7168000-f7169000 rw-p 0000c000 08:00 564188                             /lib/libnss_files-2.12.so
[  251s] f7173000-f7190000 r-xp 00000000 08:00 564376                             /lib/libselinux.so.1
[  251s] f7190000-f7191000 r--p 0001d000 08:00 564376                             /lib/libselinux.so.1
[  251s] f7191000-f7192000 rw-p 0001e000 08:00 564376                             /lib/libselinux.so.1
[  251s] f7193000-f71a8000 r-xp 00000000 08:00 564198                             /lib/libresolv-2.12.so
[  251s] f71a8000-f71aa000 r--p 00015000 08:00 564198                             /lib/libresolv-2.12.so
[  251s] f71aa000-f71ab000 rw-p 00017000 08:00 564198                             /lib/libresolv-2.12.so
[  251s] f71ab000-f71ad000 rw-p 00000000 00:00 0 
[  251s] f71b3000-f71b5000 r-xp 00000000 08:00 564355                             /lib/libkeyutils.so.1.3
[  251s] f71b5000-f71b6000 r--p 00001000 08:00 564355                             /lib/libkeyutils.so.1.3
[  251s] f71b6000-f71b7000 rw-p 00002000 08:00 564355                             /lib/libkeyutils.so.1.3
[  251s] f71bb000-f71be000 r-xp 00000000 08:00 564178                             /lib/libdl-2.12.so
[  251s] f71be000-f71bf000 r--p 00002000 08:00 564178                             /lib/libdl-2.12.so
[  251s] f71bf000-f71c0000 rw-p 00003000 08:00 564178                             /lib/libdl-2.12.so
[  251s] f71c3000-f71cd000 r-xp 00000000 08:00 564258                             /lib/libkrb5support.so.0.1
[  251s] f71cd000-f71ce000 r--p 00009000 08:00 564258                             /lib/libkrb5support.so.0.1
[  251s] f71ce000-f71cf000 rw-p 0000a000 08:00 564258                             /lib/libkrb5support.so.0.1
[  251s] f71d3000-f71d6000 r-xp 00000000 08:00 564357                             /lib/libcom_err.so.2.1
[  251s] f71d6000-f71d7000 r--p 00002000 08:00 564357                             /lib/libcom_err.so.2.1
[  251s] f71d7000-f71d8000 rw-p 00003000 08:00 564357                             /lib/libcom_err.so.2.1
[  251s] f71db000-f7203000 r-xp 00000000 08:00 564254                             /lib/libk5crypto.so.3.1
[  251s] f7203000-f7204000 r--p 00028000 08:00 564254                             /lib/libk5crypto.so.3.1
[  251s] f7204000-f7205000 rw-p 00029000 08:00 564254                             /lib/libk5crypto.so.3.1
[  251s] f7205000-f7206000 rw-p 00000000 00:00 0 
[  251s] f720b000-f72e2000 r-xp 00000000 08:00 564256                             /lib/libkrb5.so.3.3
[  251s] f72e2000-f72e8000 r--p 000d6000 08:00 564256                             /lib/libkrb5.so.3.3
[  251s] f72e8000-f72e9000 rw-p 000dc000 08:00 564256                             /lib/libkrb5.so.3.3
[  251s] f72eb000-f747c000 r-xp 00000000 08:00 564172                             /lib/libc-2.12.so
[  251s] f747c000-f747e000 r--p 00191000 08:00 564172                             /lib/libc-2.12.so
[  251s] f747e000-f747f000 rw-p 00193000 08:00 564172                             /lib/libc-2.12.so
[  251s] f747f000-f7482000 rw-p 00000000 00:00 0 
[  251s] f7483000-f74a0000 r-xp 00000000 08:00 564331                             /lib/libgcc_s-4.4.7-20120601.so.1
[  251s] f74a0000-f74a1000 rw-p 0001d000 08:00 564331                             /lib/libgcc_s-4.4.7-20120601.so.1
[  251s] f74a3000-f74cb000 r-xp 00000000 08:00 564180                             /lib/libm-2.12.so
[  251s] f74cb000-f74cc000 r--p 00027000 08:00 564180                             /lib/libm-2.12.so
[  251s] f74cc000-f74cd000 rw-p 00028000 08:00 564180                             /lib/libm-2.12.so
[  251s] f74d3000-f75b2000 r-xp 00000000 08:00 425721                             /usr/lib/libstdc++.so.6.0.13
[  251s] f75b2000-f75b6000 r--p 000de000 08:00 425721                             /usr/lib/libstdc++.so.6.0.13
[  251s] f75b6000-f75b7000 rw-p 000e2000 08:00 425721                             /usr/lib/libstdc++.so.6.0.13
[  251s] f75b7000-f75be000 rw-p 00000000 00:00 0 
[  251s] f75c3000-f75da000 r-xp 00000000 08:00 564196                             /lib/libpthread-2.12.so
[  251s] f75da000-f75db000 r--p 00016000 08:00 564196                             /lib/libpthread-2.12.so
[  251s] f75db000-f75dc000 rw-p 00017000 08:00 564196                             /lib/libpthread-2.12.so
[  251s] f75dc000-f75de000 rw-p 00000000 00:00 0 
[  251s] f75e3000-f75ea000 r-xp 00000000 08:00 564200                             /lib/librt-2.12.so
[  251s] f75ea000-f75eb000 r--p 00006000 08:00 564200                             /lib/librt-2.12.so
[  251s] f75eb000-f75ec000 rw-p 00007000 08:00 564200                             /lib/librt-2.12.so
[  251s] f75f3000-f7631000 r-xp 00000000 08:00 564250                             /lib/libgssapi_krb5.so.2.2
[  251s] f7631000-f7632000 r--p 0003e000 08:00 564250                             /lib/libgssapi_krb5.so.2.2
[  251s] f7632000-f7633000 rw-p 0003f000 08:00 564250                             /lib/libgssapi_krb5.so.2.2
[  251s] f7633000-f767f000 r-xp 00000000 08:00 427091                             /usr/lib/libpgm-5.2.so.0.0.122
[  251s] f767f000-f7680000 rw-p 0004c000 08:00 427091                             /usr/lib/libpgm-5.2.so.0.0.122
[  251s] f7680000-f7684000 rw-p 00000000 00:00 0 
[  251s] f768b000-f7711000 r-xp 00000000 08:00 427093                             /usr/lib/libsodium.so.18.1.1
[  251s] f7711000-f7712000 r--p 00085000 08:00 427093                             /usr/lib/libsodium.so.18.1.1
[  251s] f7712000-f7713000 rw-p 00086000 08:00 427093                             /usr/lib/libsodium.so.18.1.1
[  251s] f771b000-f77ab000 r-xp 00000000 08:00 753295                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/src/.libs/libzmq.so.5.1.6
[  251s] f77ab000-f77af000 rw-p 00090000 08:00 753295                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/src/.libs/libzmq.so.5.1.6
[  251s] f77b0000-f77b6000 rw-p 00000000 00:00 0 
[  251s] f77b6000-f77b9000 r--p 00000000 00:00 0                                  [vvar]
[  251s] f77b9000-f77bb000 r-xp 00000000 00:00 0                                  [vdso]
[  251s] f77bb000-f77da000 r-xp 00000000 08:00 564165                             /lib/ld-2.12.so
[  251s] f77da000-f77db000 r--p 0001e000 08:00 564165                             /lib/ld-2.12.so
[  251s] f77db000-f77dc000 rw-p 0001f000 08:00 564165                             /lib/ld-2.12.so
[  251s] f77dc000-f77df000 rw-p 00000000 00:00 0 
[  251s] ffd6a000-ffd8b000 rw-p 000000[  238.783456] serial8250: too much work for irq4
[  251s] 00 00:00 0                                  [stack]
[  251s] :133::FAIL: Expression Evaluated To FALSE/bin/sh: line 5: 20905 Aborted                 ${dir}$tst
[  251s] FAIL: tests/test_proxy_hwm

Could you please have a look?

https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/SLE_12_SP3/x86_64

@f18m
Copy link
Contributor Author

f18m commented Sep 14, 2018

Hi @bluca,
I created the PR for relicensing.
Regarding the SLES 12 build failing: I tried to explore what's happening but I guess I will need a build environment and do a debug build. This is going to take some time for me to set it up.

@f18m
Copy link
Contributor Author

f18m commented Sep 15, 2018

Hi @bluca ,
I think the latest SLES 12 build is successful:

https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/SLE_12_SP3/x86_64

I also installed a VM with OpenSUSE 12.3, gcc 4.7.2 and tried to build zmq and run unit tests: everything was successful...

@bluca
Copy link
Member

bluca commented Sep 15, 2018

@bluca
Copy link
Member

bluca commented Sep 15, 2018

Try an ASAN build on your SUSE 12.3 VM, it might help identify the issue

@bluca
Copy link
Member

bluca commented Sep 15, 2018

Actually fails also on Rawhide and Leap 15, so with much newer compilers. It looks to be a low-chance race causing a segfault. Probably thread unsafe access?

@bluca
Copy link
Member

bluca commented Sep 15, 2018

Managed to reproduce on Ubuntu 14.04 (gcc 4.8) after ~10 runs in gdb:

tarting program: /home/luca/git/libzmq/tests/.libs/test_proxy_hwm 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff359b700 (LWP 12295)]
[New Thread 0x7ffff2a94700 (LWP 12296)]
[New Thread 0x7ffff1f8d700 (LWP 12297)]
[New Thread 0x7ffff1486700 (LWP 12298)]
[New Thread 0x7ffff097f700 (LWP 12299)]
[New Thread 0x7fffefe78700 (LWP 12300)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2a94700 (LWP 12296)]
__longjmp () at ../sysdeps/x86_64/__longjmp.S:67
67	../sysdeps/x86_64/__longjmp.S: No such file or directory.
(gdb) bt
#0  __longjmp () at ../sysdeps/x86_64/__longjmp.S:67
#1  0xab498f7f657f36ae in ?? ()
Cannot access memory at address 0xab498f7f657f36ae
(gdb) info thread
  Id   Target Id         Frame 
  7    Thread 0x7fffefe78700 (LWP 12300) "ZMQbg/1" 0x00007ffff45e06d3 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:81
  6    Thread 0x7ffff097f700 (LWP 12299) "ZMQbg/0" 0x00007ffff45e06d3 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:81
  5    Thread 0x7ffff1486700 (LWP 12298) "test_proxy_hwm" 0x00007ffff45a6d4d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:81
  4    Thread 0x7ffff1f8d700 (LWP 12297) "test_proxy_hwm" 0x00007ffff45a6d4d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:81
* 3    Thread 0x7ffff2a94700 (LWP 12296) "test_proxy_hwm" __longjmp () at ../sysdeps/x86_64/__longjmp.S:67
  2    Thread 0x7ffff359b700 (LWP 12295) "test_proxy_hwm" 0x00007ffff45d2c9d in poll ()
    at ../sysdeps/unix/syscall-template.S:81
  1    Thread 0x7ffff7fea780 (LWP 12294) "test_proxy_hwm" 0x00007ffff48b465b in pthread_join (
    threadid=140737264568064, thread_return=0x0) at pthread_join.c:92

It's the publisher thread in the test

@bluca
Copy link
Member

bluca commented Sep 15, 2018

So it's the unity assert that triggers the segfault, for some reason... gdb can't recreate the stack trace after the segfault, but setting a breakpoint at __longjmp works:

Thread 3 "test_proxy_hwm" hit Breakpoint 1, __longjmp () at ../sysdeps/x86_64/__longjmp.S:29
29	../sysdeps/x86_64/__longjmp.S: No such file or directory.
(gdb) bt
#0  __longjmp () at ../sysdeps/x86_64/__longjmp.S:29
#1  0x00007ffff68d1deb in __libc_siglongjmp (env=0x55555575c630 <Unity+80>, val=<optimized out>) at longjmp.c:38
#2  0x0000555555559b80 in UnityFail (msg=0x55555555a4b8 " Expression Evaluated To FALSE", line=133)
    at external/unity/unity.c:1310
#3  0x0000555555557030 in publisher_thread_main (pvoid=0x7fffffffddd0) at tests/test_proxy_hwm.cpp:133
#4  0x00007ffff7b7e358 in thread_routine (arg_=0x55555576f690) at src/thread.cpp:182
#5  0x00007ffff74e2494 in start_thread (arg=0x7ffff5c5c700) at pthread_create.c:333
#6  0x00007ffff6987acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

@sigiesec any idea why unity segfaults like that in TEST_ABORT when ran in another thread? Do we need to tweak the build?
https://github.com/zeromq/libzmq/blob/master/external/unity/unity_internals.h#L591

@f18m the HWM assert is failing here:

https://github.com/zeromq/libzmq/blob/master/tests/test_proxy_hwm.cpp#L133

So if it didn't segfault, it would fail. It happens once in 10-20 runs. Compiler version doesn't seem to matter. Could you please check?

@f18m
Copy link
Contributor Author

f18m commented Sep 16, 2018

Hi @bluca ,
thanks for your investigation.
I temporarily replaced the TEST_ASSERT() with a printf and indeed running the test 30 times I see that on line 133 sometimes the "send_count" variable takes on an intermediate value between 20 and 40, i.e. it takes value 30.

Is this the result of some race condition in HWM "handshaking" internal to zmq::pipe_t ?

In the test_proxy_hwm the HWM values of PUB and SUB sockets are set BEFORE calling zmq_connect(). The HWM values of the proxy XPUB and XSUB sockets are set BEFORE calling zmq_bind(). From my understanding when the chain
PUB -> XSUB ->XPUB -> SUB
is built, the pipes should all accept a number = HWM = 10 of messages (or twice that number since it looks like for inproc transport the total number of msgs that can be queued is computed as TX HWM + RX HWM).
That is why the publisher threads checks for 2 * HWM. It also accepts 4 * HWM because experimentally I found that most of the times 4 * HWM messages are queued as well (I guess for that doubling TX+RX I mentioned). Now we discovered sometimes 3 * HWM messages are queued.

Is there some logic on why this happens?

Of course the workaround fix for the test is trivial and involves adding 3*HWM condition to:
L133: TEST_ASSERT (4 * HWM == send_count || 2 * HWM == send_count);
and later at RX side as well:
TEST_ASSERT (4 * HWM == rxsuccess || 2 * HWM == rxsuccess);

Perhaps I could try adding a pthread_barrier_t to allow all sockets to handshake their HWM values before starting sending messages?

@sigiesec
Copy link
Member

The unity assertions should be threadsafe AFAIK. Maybe there is some corruption before, which results in the segfault in the assertion?

@bluca
Copy link
Member

bluca commented Sep 16, 2018

The unity assertions should be threadsafe AFAIK. Maybe there is some corruption before, which results in the segfault in the assertion?

@sigiesec the thread looks extremely simple, can't see where that would happen, any idea? I'll switch to standard assert for now

Perhaps I could try adding a pthread_barrier_t to allow all sockets to handshake their HWM values before starting sending messages?

@f18m I don't think that's the issue. They are finalised in a critical section so there's already barriers in place:

https://github.com/zeromq/libzmq/blob/master/src/ctx.cpp#L570

I think there are broadly speaking 2 problems.

The way PUB is designed to work, is that if there are no connections (no pipes), it will discard the message immediately, without even trying to queue it on the pipe, and thus the HWM never increments:

https://github.com/zeromq/libzmq/blob/master/src/dist.cpp#L164

The NODROP option was added much later (and IMHO it's a bit hacky...), and simply didn't consider this case. And I'm not sure it should either, but that's another story.

So on slow architecture or overworked builders the msleep is not enough., and the PUB thread starts and sends messages without having any subscriptions (but without EAGAIN either) so the counter increases massively. I've seen this a couple of times. To achieve that there are either other socket types, or design patterns that use an out-of-band deterministic socket to synchronise (eg: START/STOP messages), or much simpler, use XPUB and wait to receive the subscription before starting to publish, which will be deterministic in the sense that it won't start until there is at least one listener, regardless of system load.

The second and most important issue is that you have 3 threads doing I/O across 2 queues. Depending on the scheduling, it might happen that 20, 30 or 40 messages go through before the pub blocks.
That's because the receiver thread gets kicked once every (hwm_ + 1) / 2 sent messages:

https://github.com/zeromq/libzmq/blob/master/src/pipe.cpp#L483

So depending on the scheduling of the second thread, the publisher might get one, two or three more batches in. The ceiling is 40 as there's 2 queues.

So the test should really check for a range - HWM * 2 * npipes.

@f18m
Copy link
Contributor Author

f18m commented Sep 16, 2018

@bluca ,
thanks for the investigation and the explanation.
I modified the test to address all problems encountered; both PUB/SUB sync (by using XPUB as you suggested) and also accepting 3*HWM as valid value.
See
#3254

With these changes the test ran fine 200 times on my system while it failed after 20-30 runs previously.

@bluca
Copy link
Member

bluca commented Sep 16, 2018

I had opened #3253 but I'll rebase it instead. I think it's better to have a range, just to be sure.

There's also test_hwm_pubsub that fails, even more rarely but still happens (took ~50 runs in valgrind to repro on my desktop):

WARNING: Forced closure of 2 sockets, this is an implementation error unless the test case failed
tests/test_hwm_pubsub.cpp:150:test_tcp:FAIL: Expected 6000 Was 5000
tests/test_hwm_pubsub.cpp:264:test_inproc:PASS
tests/test_hwm_pubsub.cpp:266:test_ipc:PASS
tests/test_hwm_pubsub.cpp:268:test_reset_hwm:PASS

-----------------------
4 Tests 1 Failures 0 Ignored 
FAIL

@f18m
Copy link
Contributor Author

f18m commented Sep 17, 2018

yeah I guess that's a race condition due to the msleep() call:

msleep (2 * SETTLE_TIME); // required for TCP transport

That triggers me the question: wherever I see that msleep (SETTLE_TIME); in testing code I guess there's a possible race condition... isn't it ?
I don't like that very much but if that msleep() is the way e.g. connection race conditions are handled all over the code in unit tests, we may simply increase that 2* to something like 10* to reduce even more the chance of the test to fail...
An alternative "fix" could be to increase the RX timeout on the sub_socket...

@bluca
Copy link
Member

bluca commented Sep 17, 2018

The right solution is to synchronize, which is what is commonly done in production - that case looks like you can do the same as I did for the other test, switch to XPUB and don't start sending until you get the subscription message

@sigiesec
Copy link
Member

I think they are not race conditions in a strict sense, but timing dependencies. There are cases where it is used as a timeout, assuming that if something does not happen within SETTLE_TIME, it will never happen. On slow machines, this may lead to false positives, in the sense that the "something" happens, but only after the timeout.

Other cases expect that something happens, assuming that if it ever happens, it will within SETTLE_TIME. On slow machines, this leads to false negatives, i.e. a test cases fails but the "something" would have happened after the timeout.

By increasing SETTLE_TIME, the probability of such false positives and negatives could be reduced, but not eliminated.

However, increasing the SETTLE_TIME will also increase the test execution time and therefore reduce feedback from the CI jobs.

In some cases, the test code is probably suboptimal as sleeping for a fixed time could be replaced by an operation with timeout (in the second case), and in others by explicitly waiting for the negated event (in the first case), the latter possibly requiring some extension of libzmq.

@f18m
Copy link
Contributor Author

f18m commented Sep 18, 2018

Hi @sigiesec ,

By increasing SETTLE_TIME, the probability of such false positives and negatives could be reduced, but not eliminated.

This really sounds like a race condition (between zeromq background threads and the test main thread)... anyway my intent is not point out this aspect, just understanding.

@bluca ,

The right solution is to synchronize, which is what is commonly done in production - that case looks like you can do the same as I did for the other test, switch to XPUB and don't start sending until you get the subscription message

Yeah, that looks like a good solution as well... I can try that next days (probably not tomorrow)

@f18m
Copy link
Contributor Author

f18m commented Sep 20, 2018

Hi @bluca
I created #3255 to fix also test_hwm_pubsub. However even after running that test hundreds of time I could not reproduce the test failure on my desktop machine.
Of course I ran hundreds of times also AFTER applying my changes, and all runs completed OK

@f18m
Copy link
Contributor Author

f18m commented Sep 20, 2018

Very last issue: I cannot see updated docs online at:
http://api.zeromq.org/master:zmq-setsockopt

Moreover there is some discrepancy: in github I see version 4.2.5 available but the website lists version 4.2.3 as latest one (and latest stable docs are from 4.2.3 as well)...

@bluca
Copy link
Member

bluca commented Sep 20, 2018

Yes the website is not updated automatically

MohammadAlTurany pushed a commit to FairRootGroup/libzmq that referenced this pull request Jan 28, 2019
Add new HWM tests and more detailed documentation
@f18m f18m deleted the hwm_tests branch September 1, 2019 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants