Hwm tests #3242

f18m · 2018-09-04T08:19:37Z

Problem: correct HWM handling is checked only on INPROC transport
Problem: the ZMQ proxy control socket is unusable when the proxy reaches its XPUB HWM and lossy flag is set.

Solution:

extend existing HWM unit test to test also TCP and IPC transports
add new unit test that shows that ZMQ proxy control socket is unusable when the proxy reaches its XPUB HWM and lossy flag is set and how to cope with that.

Note: this PR is still work-in-progress. In particular I need to think how to organize the new unit test on the proxy; currently it requires human inspection to understand what's going on.

…tests

…avoid having asker thread blocked

bluca · 2018-09-04T08:48:14Z

tests/test_hwm_pubsub.cpp

-    // send 1000 msg on hwm 1000, receive 1000
-    TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000));
+    // send 1000 msg on hwm 1000, receive 1000, on TCP transport
+    TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, ENDPOINT_0));


This will cause tests that run in parallel to fail, please use "tcp://127.0.0.1:*" and change the code to use the _ENDPOINT socket option

bluca · 2018-09-04T08:48:38Z

Please also add the new test to tests/CMakeLists.txt

… new proxy HWM test also to CMake

…f empty packets

…ting 2*HWM messages are received

f18m · 2018-09-08T09:28:09Z

Hi @bluca,
I modified the "test_hwm_pubsub" to use random port for TCP.
I added "test_proxy_hwm" to CMake and also cleaned it up. I added comments at the beginning of that source file to explain what's its purpose.

I'll keep an eye on automatic appVeyor and TravisCI checks !

Thanks

bluca · 2018-09-08T09:42:04Z

tests/test_hwm_pubsub.cpp

+
+    TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "tcp://127.0.0.1:*"));
+    TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "inproc://a"));
+    TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "ipc://@tmp-tester"));


IPC does not work on Windows so it will fail, this line and the other one need ifndef ZMQ_HAVE_WINDOWS

f18m · 2018-09-08T10:33:57Z

Hi @bluca,
I disabled IPC test on Windows and also tried to address some shortcomings in docs.

bluca · 2018-09-08T13:04:34Z

doc/zmq_setsockopt.txt

@@ -854,11 +858,13 @@ If this limit has been reached the socket shall enter an exceptional state and
 depending on the socket type, 0MQ shall take appropriate action such as
 blocking or dropping sent messages. Refer to the individual socket descriptions
 in linkzmq:zmq_socket[3] for details on the exact action taken for each socket
-type.
+type. 


please remove this extra whitespace

bluca · 2018-09-08T13:07:25Z

The HWM tests are failing on Travis, could you please check? Thanks

…st to avoid failures on TravisCI / appVeyor

…hole transmission, after SETTLE_TIME

f18m · 2018-09-09T09:38:23Z

I checked and some of the failures are related to TCP transport testcases failing due to timing issues... I moved a check after a wait of SETTLE_TIME ms to improve test reliability.

Other builds are failing due to cclang issues about source code formatting... do you have any hint on how to fix these? Thanks!

bluca · 2018-09-09T10:07:36Z

There's only one clang format job and it prints the diff that needs to be applied, you can also run make format-check

But there's also the IPC test failing to bind on osx

bluca · 2018-09-09T12:46:06Z

tests/test_hwm_pubsub.cpp

+
+void test_ipc ()
+{
+    TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "ipc://@tmp-tester1"));


the issue on OSX is that this is trying to use the abstract namespace, which as the zmq_ipc manpage says it's linux only - you can use /tmp or a wildcard ipc://* and the _endpoint call

ok, should be fixed now

…'t use abstract IPC endpoints since they don't work on iOS

f18m · 2018-09-11T18:54:09Z

@bluca : is there some special macro to emit debugging prints from a unit test source file?
I was using in test_proxy_hwm.cpp a custom macro taking variadic list but I see some Travis builds fail with:

../tests/test_proxy_hwm.cpp:63:23: error: anonymous variadic macros were introduced in C99 [-Werror=variadic-macros]

so shall I use printf() directly? Shall I remove debug prints (this could be a bad idea for debugging in future)?

Thanks

f18m · 2018-09-11T18:55:46Z

Moreover I see that on OSX the trick of using "ipc://*" and then ZMQ_LAST_ENDPOINT seems to fail:

../../tests/test_hwm_pubsub.cpp:55:test_ipc:FAIL: zmq_getsockopt (pub_socket, ZMQ_LAST_ENDPOINT, pub_endpoint, &len) failed, errno = 22 (Invalid argument)

am I missing something? Is this a bug in ZMQ OSX implementation?

f18m · 2018-09-13T17:38:15Z

well maybe I found the root cause of the memleak actually: even if valgrind reports the leak on the PUB side it is probably due to the SUB side thread.
The fact is that proxy_hwm test is using inproc transport so that the message is not allocated on the SUB side...
I hope it's fixed now

bluca · 2018-09-13T19:39:13Z

tests/test_proxy_hwm.cpp

+            TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (
+              subsocket, ZMQ_RCVTIMEO, &timeout_ms, sizeof (timeout_ms)));
+        } else {
+            TEST_ASSERT_SUCCESS_ERRNO (zmq_msg_close (&msg));


this close call needs to be in the other branch of the if, that's what's causing the leak

you're right - my fault.. I intended to puth the close on the success path indeed

bluca · 2018-09-13T19:39:49Z

It's still there, see comment inline for the fix

bluca · 2018-09-13T21:16:28Z

@f18m thanks for your PR, it looks good now, merged.

Could you please send another PR with a relicensing grant. See README here for details: https://github.com/zeromq/libzmq/tree/master/RELICENSE and this example: https://github.com/zeromq/libzmq/blob/master/RELICENSE/WenbinHou.md

bluca · 2018-09-13T21:33:43Z

@f18m actually it's causing a failure to build with older gcc:

[  249s] 11 Tests 0 Failures 0 Ignored 
[  249s] OK
[  249s] PASS: tests/test_inproc_connect
[  249s] PASS: tests/test_issue_566
[  250s] PASS: tests/test_proxy
[  251s] *** longjmp causes uninitialized stack frame ***: /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm terminated
[  251s] ======= Backtrace: =========
[  251s] /lib/libc.so.6(__fortify_fail+0x4d)[0xf73e829d]
[  251s] /lib/libc.so.6(+0xfd20a)[0xf73e820a]
[  251s] /lib/libc.so.6(__longjmp_chk+0x49)[0xf73e8179]
[  251s] /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm[0x804c2ec]
[  251s] /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm[0x804a505]
[  251s] /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/src/.libs/libzmq.so.5(+0x5ee93)[0xf7779e93]
[  251s] /lib/libpthread.so.0(+0x6bc9)[0xf75c9bc9]
[  251s] /lib/libc.so.6(clone+0x5e)[0xf73ce11e]
[  251s] ======= Memory map: ========
[  251s] 08048000-08050000 r-xp 00000000 08:00 754184                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm
[  251s] 08050000-08051000 rw-p 00007000 08:00 754184                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/tests/.libs/lt-test_proxy_hwm
[  251s] 09d5f000-09d80000 rw-p 00000000 00:00 0                                  [heap]
[  251s] f3a00000-f3a7b000 rw-p 00000000 00:00 0 
[  251s] f3a7b000-f3b00000 ---p 00000000 00:00 0 
[  251s] f3c00000-f3c21000 rw-p 00000000 00:00 0 
[  251s] f3c21000-f3d00000 ---p 00000000 00:00 0 
[  251s] f3d00000-f3dfe000 rw-p 00000000 00:00 0 
[  251s] f3dfe000-f3e00000 ---p 00000000 00:00 0 
[  251s] f3e00000-f3e21000 rw-p 00000000 00:00 0 
[  251s] f3e21000-f3f00000 ---p 00000000 00:00 0 
[  251s] f3ffc000-f3ffd000 ---p 00000000 00:00 0 
[  251s] f3ffd000-f47fd000 rw-p 00000000 00:00 0 
[  251s] f47fd000-f47fe000 ---p 00000000 00:00 0 
[  251s] f47fe000-f4ffe000 rw-p 00000000 00:00 0 
[  251s] f4ffe000-f4fff000 ---p 00000000 00:00 0 
[  251s] f4fff000-f57ff000 rw-p 00000000 00:00 0 
[  251s] f57ff000-f5800000 ---p 00000000 00:00 0 
[  251s] f5800000-f6000000 rw-p 00000000 00:00 0 
[  251s] f6000000-f6022000 rw-p 00000000 00:00 0 
[  251s] f6022000-f6100000 ---p 00000000 00:00 0 
[  251s] f6159000-f615a000 ---p 00000000 00:00 0 
[  251s] f615a000-f695a000 rw-p 00000000 00:00 0 
[  251s] f695a000-f695b000 ---p 00000000 00:00 0 
[  251s] f695b000-f715b000 rw-p 00000000 00:00 0 
[  251s] f715b000-f7167000 r-xp 00000000 08:00 564188                             /lib/libnss_files-2.12.so
[  251s] f7167000-f7168000 r--p 0000b000 08:00 564188                             /lib/libnss_files-2.12.so
[  251s] f7168000-f7169000 rw-p 0000c000 08:00 564188                             /lib/libnss_files-2.12.so
[  251s] f7173000-f7190000 r-xp 00000000 08:00 564376                             /lib/libselinux.so.1
[  251s] f7190000-f7191000 r--p 0001d000 08:00 564376                             /lib/libselinux.so.1
[  251s] f7191000-f7192000 rw-p 0001e000 08:00 564376                             /lib/libselinux.so.1
[  251s] f7193000-f71a8000 r-xp 00000000 08:00 564198                             /lib/libresolv-2.12.so
[  251s] f71a8000-f71aa000 r--p 00015000 08:00 564198                             /lib/libresolv-2.12.so
[  251s] f71aa000-f71ab000 rw-p 00017000 08:00 564198                             /lib/libresolv-2.12.so
[  251s] f71ab000-f71ad000 rw-p 00000000 00:00 0 
[  251s] f71b3000-f71b5000 r-xp 00000000 08:00 564355                             /lib/libkeyutils.so.1.3
[  251s] f71b5000-f71b6000 r--p 00001000 08:00 564355                             /lib/libkeyutils.so.1.3
[  251s] f71b6000-f71b7000 rw-p 00002000 08:00 564355                             /lib/libkeyutils.so.1.3
[  251s] f71bb000-f71be000 r-xp 00000000 08:00 564178                             /lib/libdl-2.12.so
[  251s] f71be000-f71bf000 r--p 00002000 08:00 564178                             /lib/libdl-2.12.so
[  251s] f71bf000-f71c0000 rw-p 00003000 08:00 564178                             /lib/libdl-2.12.so
[  251s] f71c3000-f71cd000 r-xp 00000000 08:00 564258                             /lib/libkrb5support.so.0.1
[  251s] f71cd000-f71ce000 r--p 00009000 08:00 564258                             /lib/libkrb5support.so.0.1
[  251s] f71ce000-f71cf000 rw-p 0000a000 08:00 564258                             /lib/libkrb5support.so.0.1
[  251s] f71d3000-f71d6000 r-xp 00000000 08:00 564357                             /lib/libcom_err.so.2.1
[  251s] f71d6000-f71d7000 r--p 00002000 08:00 564357                             /lib/libcom_err.so.2.1
[  251s] f71d7000-f71d8000 rw-p 00003000 08:00 564357                             /lib/libcom_err.so.2.1
[  251s] f71db000-f7203000 r-xp 00000000 08:00 564254                             /lib/libk5crypto.so.3.1
[  251s] f7203000-f7204000 r--p 00028000 08:00 564254                             /lib/libk5crypto.so.3.1
[  251s] f7204000-f7205000 rw-p 00029000 08:00 564254                             /lib/libk5crypto.so.3.1
[  251s] f7205000-f7206000 rw-p 00000000 00:00 0 
[  251s] f720b000-f72e2000 r-xp 00000000 08:00 564256                             /lib/libkrb5.so.3.3
[  251s] f72e2000-f72e8000 r--p 000d6000 08:00 564256                             /lib/libkrb5.so.3.3
[  251s] f72e8000-f72e9000 rw-p 000dc000 08:00 564256                             /lib/libkrb5.so.3.3
[  251s] f72eb000-f747c000 r-xp 00000000 08:00 564172                             /lib/libc-2.12.so
[  251s] f747c000-f747e000 r--p 00191000 08:00 564172                             /lib/libc-2.12.so
[  251s] f747e000-f747f000 rw-p 00193000 08:00 564172                             /lib/libc-2.12.so
[  251s] f747f000-f7482000 rw-p 00000000 00:00 0 
[  251s] f7483000-f74a0000 r-xp 00000000 08:00 564331                             /lib/libgcc_s-4.4.7-20120601.so.1
[  251s] f74a0000-f74a1000 rw-p 0001d000 08:00 564331                             /lib/libgcc_s-4.4.7-20120601.so.1
[  251s] f74a3000-f74cb000 r-xp 00000000 08:00 564180                             /lib/libm-2.12.so
[  251s] f74cb000-f74cc000 r--p 00027000 08:00 564180                             /lib/libm-2.12.so
[  251s] f74cc000-f74cd000 rw-p 00028000 08:00 564180                             /lib/libm-2.12.so
[  251s] f74d3000-f75b2000 r-xp 00000000 08:00 425721                             /usr/lib/libstdc++.so.6.0.13
[  251s] f75b2000-f75b6000 r--p 000de000 08:00 425721                             /usr/lib/libstdc++.so.6.0.13
[  251s] f75b6000-f75b7000 rw-p 000e2000 08:00 425721                             /usr/lib/libstdc++.so.6.0.13
[  251s] f75b7000-f75be000 rw-p 00000000 00:00 0 
[  251s] f75c3000-f75da000 r-xp 00000000 08:00 564196                             /lib/libpthread-2.12.so
[  251s] f75da000-f75db000 r--p 00016000 08:00 564196                             /lib/libpthread-2.12.so
[  251s] f75db000-f75dc000 rw-p 00017000 08:00 564196                             /lib/libpthread-2.12.so
[  251s] f75dc000-f75de000 rw-p 00000000 00:00 0 
[  251s] f75e3000-f75ea000 r-xp 00000000 08:00 564200                             /lib/librt-2.12.so
[  251s] f75ea000-f75eb000 r--p 00006000 08:00 564200                             /lib/librt-2.12.so
[  251s] f75eb000-f75ec000 rw-p 00007000 08:00 564200                             /lib/librt-2.12.so
[  251s] f75f3000-f7631000 r-xp 00000000 08:00 564250                             /lib/libgssapi_krb5.so.2.2
[  251s] f7631000-f7632000 r--p 0003e000 08:00 564250                             /lib/libgssapi_krb5.so.2.2
[  251s] f7632000-f7633000 rw-p 0003f000 08:00 564250                             /lib/libgssapi_krb5.so.2.2
[  251s] f7633000-f767f000 r-xp 00000000 08:00 427091                             /usr/lib/libpgm-5.2.so.0.0.122
[  251s] f767f000-f7680000 rw-p 0004c000 08:00 427091                             /usr/lib/libpgm-5.2.so.0.0.122
[  251s] f7680000-f7684000 rw-p 00000000 00:00 0 
[  251s] f768b000-f7711000 r-xp 00000000 08:00 427093                             /usr/lib/libsodium.so.18.1.1
[  251s] f7711000-f7712000 r--p 00085000 08:00 427093                             /usr/lib/libsodium.so.18.1.1
[  251s] f7712000-f7713000 rw-p 00086000 08:00 427093                             /usr/lib/libsodium.so.18.1.1
[  251s] f771b000-f77ab000 r-xp 00000000 08:00 753295                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/src/.libs/libzmq.so.5.1.6
[  251s] f77ab000-f77af000 rw-p 00090000 08:00 753295                             /home/abuild/rpmbuild/BUILD/zeromq-4.2.5+git20180913/src/.libs/libzmq.so.5.1.6
[  251s] f77b0000-f77b6000 rw-p 00000000 00:00 0 
[  251s] f77b6000-f77b9000 r--p 00000000 00:00 0                                  [vvar]
[  251s] f77b9000-f77bb000 r-xp 00000000 00:00 0                                  [vdso]
[  251s] f77bb000-f77da000 r-xp 00000000 08:00 564165                             /lib/ld-2.12.so
[  251s] f77da000-f77db000 r--p 0001e000 08:00 564165                             /lib/ld-2.12.so
[  251s] f77db000-f77dc000 rw-p 0001f000 08:00 564165                             /lib/ld-2.12.so
[  251s] f77dc000-f77df000 rw-p 00000000 00:00 0 
[  251s] ffd6a000-ffd8b000 rw-p 000000[  238.783456] serial8250: too much work for irq4
[  251s] 00 00:00 0                                  [stack]
[  251s] :133::FAIL: Expression Evaluated To FALSE/bin/sh: line 5: 20905 Aborted                 ${dir}$tst
[  251s] FAIL: tests/test_proxy_hwm

Could you please have a look?

https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/SLE_12_SP3/x86_64

f18m · 2018-09-14T18:49:55Z

Hi @bluca,
I created the PR for relicensing.
Regarding the SLES 12 build failing: I tried to explore what's happening but I guess I will need a build environment and do a debug build. This is going to take some time for me to set it up.

f18m · 2018-09-15T09:39:55Z

Hi @bluca ,
I think the latest SLES 12 build is successful:

https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/SLE_12_SP3/x86_64

I also installed a VM with OpenSUSE 12.3, gcc 4.7.2 and tried to build zmq and run unit tests: everything was successful...

bluca · 2018-09-15T09:44:38Z

Yeah but it fails plenty of times on old distros:

https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/CentOS_6/i586
https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/Fedora_27/x86_64
https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/CentOS_6/x86_64

So there is a race condition somewhere that might or might not trigger. It either needs to be fixed, or the commit reverted.

bluca · 2018-09-15T09:49:03Z

Try an ASAN build on your SUSE 12.3 VM, it might help identify the issue

bluca · 2018-09-15T17:11:11Z

Actually fails also on Rawhide and Leap 15, so with much newer compilers. It looks to be a low-chance race causing a segfault. Probably thread unsafe access?

bluca · 2018-09-15T17:13:12Z

Managed to reproduce on Ubuntu 14.04 (gcc 4.8) after ~10 runs in gdb:

tarting program: /home/luca/git/libzmq/tests/.libs/test_proxy_hwm 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff359b700 (LWP 12295)]
[New Thread 0x7ffff2a94700 (LWP 12296)]
[New Thread 0x7ffff1f8d700 (LWP 12297)]
[New Thread 0x7ffff1486700 (LWP 12298)]
[New Thread 0x7ffff097f700 (LWP 12299)]
[New Thread 0x7fffefe78700 (LWP 12300)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2a94700 (LWP 12296)]
__longjmp () at ../sysdeps/x86_64/__longjmp.S:67
67	../sysdeps/x86_64/__longjmp.S: No such file or directory.
(gdb) bt
#0  __longjmp () at ../sysdeps/x86_64/__longjmp.S:67
#1  0xab498f7f657f36ae in ?? ()
Cannot access memory at address 0xab498f7f657f36ae
(gdb) info thread
  Id   Target Id         Frame 
  7    Thread 0x7fffefe78700 (LWP 12300) "ZMQbg/1" 0x00007ffff45e06d3 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:81
  6    Thread 0x7ffff097f700 (LWP 12299) "ZMQbg/0" 0x00007ffff45e06d3 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:81
  5    Thread 0x7ffff1486700 (LWP 12298) "test_proxy_hwm" 0x00007ffff45a6d4d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:81
  4    Thread 0x7ffff1f8d700 (LWP 12297) "test_proxy_hwm" 0x00007ffff45a6d4d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:81
* 3    Thread 0x7ffff2a94700 (LWP 12296) "test_proxy_hwm" __longjmp () at ../sysdeps/x86_64/__longjmp.S:67
  2    Thread 0x7ffff359b700 (LWP 12295) "test_proxy_hwm" 0x00007ffff45d2c9d in poll ()
    at ../sysdeps/unix/syscall-template.S:81
  1    Thread 0x7ffff7fea780 (LWP 12294) "test_proxy_hwm" 0x00007ffff48b465b in pthread_join (
    threadid=140737264568064, thread_return=0x0) at pthread_join.c:92

It's the publisher thread in the test

bluca · 2018-09-15T18:13:44Z

So it's the unity assert that triggers the segfault, for some reason... gdb can't recreate the stack trace after the segfault, but setting a breakpoint at __longjmp works:

Thread 3 "test_proxy_hwm" hit Breakpoint 1, __longjmp () at ../sysdeps/x86_64/__longjmp.S:29
29	../sysdeps/x86_64/__longjmp.S: No such file or directory.
(gdb) bt
#0  __longjmp () at ../sysdeps/x86_64/__longjmp.S:29
#1  0x00007ffff68d1deb in __libc_siglongjmp (env=0x55555575c630 <Unity+80>, val=<optimized out>) at longjmp.c:38
#2  0x0000555555559b80 in UnityFail (msg=0x55555555a4b8 " Expression Evaluated To FALSE", line=133)
    at external/unity/unity.c:1310
#3  0x0000555555557030 in publisher_thread_main (pvoid=0x7fffffffddd0) at tests/test_proxy_hwm.cpp:133
#4  0x00007ffff7b7e358 in thread_routine (arg_=0x55555576f690) at src/thread.cpp:182
#5  0x00007ffff74e2494 in start_thread (arg=0x7ffff5c5c700) at pthread_create.c:333
#6  0x00007ffff6987acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

@sigiesec any idea why unity segfaults like that in TEST_ABORT when ran in another thread? Do we need to tweak the build?
https://github.com/zeromq/libzmq/blob/master/external/unity/unity_internals.h#L591

@f18m the HWM assert is failing here:

https://github.com/zeromq/libzmq/blob/master/tests/test_proxy_hwm.cpp#L133

So if it didn't segfault, it would fail. It happens once in 10-20 runs. Compiler version doesn't seem to matter. Could you please check?

f18m · 2018-09-16T10:28:12Z

Hi @bluca ,
thanks for your investigation.
I temporarily replaced the TEST_ASSERT() with a printf and indeed running the test 30 times I see that on line 133 sometimes the "send_count" variable takes on an intermediate value between 20 and 40, i.e. it takes value 30.

Is this the result of some race condition in HWM "handshaking" internal to zmq::pipe_t ?

In the test_proxy_hwm the HWM values of PUB and SUB sockets are set BEFORE calling zmq_connect(). The HWM values of the proxy XPUB and XSUB sockets are set BEFORE calling zmq_bind(). From my understanding when the chain
PUB -> XSUB ->XPUB -> SUB
is built, the pipes should all accept a number = HWM = 10 of messages (or twice that number since it looks like for inproc transport the total number of msgs that can be queued is computed as TX HWM + RX HWM).
That is why the publisher threads checks for 2 * HWM. It also accepts 4 * HWM because experimentally I found that most of the times 4 * HWM messages are queued as well (I guess for that doubling TX+RX I mentioned). Now we discovered sometimes 3 * HWM messages are queued.

Is there some logic on why this happens?

Of course the workaround fix for the test is trivial and involves adding 3*HWM condition to:
L133: TEST_ASSERT (4 * HWM == send_count || 2 * HWM == send_count);
and later at RX side as well:
TEST_ASSERT (4 * HWM == rxsuccess || 2 * HWM == rxsuccess);

Perhaps I could try adding a pthread_barrier_t to allow all sockets to handshake their HWM values before starting sending messages?

sigiesec · 2018-09-16T12:49:45Z

The unity assertions should be threadsafe AFAIK. Maybe there is some corruption before, which results in the segfault in the assertion?

bluca · 2018-09-16T18:12:00Z

The unity assertions should be threadsafe AFAIK. Maybe there is some corruption before, which results in the segfault in the assertion?

@sigiesec the thread looks extremely simple, can't see where that would happen, any idea? I'll switch to standard assert for now

Perhaps I could try adding a pthread_barrier_t to allow all sockets to handshake their HWM values before starting sending messages?

@f18m I don't think that's the issue. They are finalised in a critical section so there's already barriers in place:

https://github.com/zeromq/libzmq/blob/master/src/ctx.cpp#L570

I think there are broadly speaking 2 problems.

The way PUB is designed to work, is that if there are no connections (no pipes), it will discard the message immediately, without even trying to queue it on the pipe, and thus the HWM never increments:

https://github.com/zeromq/libzmq/blob/master/src/dist.cpp#L164

The NODROP option was added much later (and IMHO it's a bit hacky...), and simply didn't consider this case. And I'm not sure it should either, but that's another story.

So on slow architecture or overworked builders the msleep is not enough., and the PUB thread starts and sends messages without having any subscriptions (but without EAGAIN either) so the counter increases massively. I've seen this a couple of times. To achieve that there are either other socket types, or design patterns that use an out-of-band deterministic socket to synchronise (eg: START/STOP messages), or much simpler, use XPUB and wait to receive the subscription before starting to publish, which will be deterministic in the sense that it won't start until there is at least one listener, regardless of system load.

The second and most important issue is that you have 3 threads doing I/O across 2 queues. Depending on the scheduling, it might happen that 20, 30 or 40 messages go through before the pub blocks.
That's because the receiver thread gets kicked once every (hwm_ + 1) / 2 sent messages:

https://github.com/zeromq/libzmq/blob/master/src/pipe.cpp#L483

So depending on the scheduling of the second thread, the publisher might get one, two or three more batches in. The ceiling is 40 as there's 2 queues.

So the test should really check for a range - HWM * 2 * npipes.

f18m · 2018-09-16T20:21:01Z

@bluca ,
thanks for the investigation and the explanation.
I modified the test to address all problems encountered; both PUB/SUB sync (by using XPUB as you suggested) and also accepting 3*HWM as valid value.
See
#3254

With these changes the test ran fine 200 times on my system while it failed after 20-30 runs previously.

bluca · 2018-09-16T20:22:47Z

I had opened #3253 but I'll rebase it instead. I think it's better to have a range, just to be sure.

There's also test_hwm_pubsub that fails, even more rarely but still happens (took ~50 runs in valgrind to repro on my desktop):

WARNING: Forced closure of 2 sockets, this is an implementation error unless the test case failed
tests/test_hwm_pubsub.cpp:150:test_tcp:FAIL: Expected 6000 Was 5000
tests/test_hwm_pubsub.cpp:264:test_inproc:PASS
tests/test_hwm_pubsub.cpp:266:test_ipc:PASS
tests/test_hwm_pubsub.cpp:268:test_reset_hwm:PASS

-----------------------
4 Tests 1 Failures 0 Ignored 
FAIL

f18m · 2018-09-17T17:12:57Z

yeah I guess that's a race condition due to the msleep() call:

libzmq/tests/test_hwm_pubsub.cpp

Line 148 in ea517a2

msleep (2 * SETTLE_TIME); // required for TCP transport

That triggers me the question: wherever I see that msleep (SETTLE_TIME); in testing code I guess there's a possible race condition... isn't it ?
I don't like that very much but if that msleep() is the way e.g. connection race conditions are handled all over the code in unit tests, we may simply increase that 2* to something like 10* to reduce even more the chance of the test to fail...
An alternative "fix" could be to increase the RX timeout on the sub_socket...

bluca · 2018-09-17T17:25:30Z

The right solution is to synchronize, which is what is commonly done in production - that case looks like you can do the same as I did for the other test, switch to XPUB and don't start sending until you get the subscription message

sigiesec · 2018-09-17T17:35:36Z

I think they are not race conditions in a strict sense, but timing dependencies. There are cases where it is used as a timeout, assuming that if something does not happen within SETTLE_TIME, it will never happen. On slow machines, this may lead to false positives, in the sense that the "something" happens, but only after the timeout.

Other cases expect that something happens, assuming that if it ever happens, it will within SETTLE_TIME. On slow machines, this leads to false negatives, i.e. a test cases fails but the "something" would have happened after the timeout.

By increasing SETTLE_TIME, the probability of such false positives and negatives could be reduced, but not eliminated.

However, increasing the SETTLE_TIME will also increase the test execution time and therefore reduce feedback from the CI jobs.

In some cases, the test code is probably suboptimal as sleeping for a fixed time could be replaced by an operation with timeout (in the second case), and in others by explicitly waiting for the negated event (in the first case), the latter possibly requiring some extension of libzmq.

f18m · 2018-09-18T22:15:07Z

Hi @sigiesec ,

By increasing SETTLE_TIME, the probability of such false positives and negatives could be reduced, but not eliminated.

This really sounds like a race condition (between zeromq background threads and the test main thread)... anyway my intent is not point out this aspect, just understanding.

@bluca ,

The right solution is to synchronize, which is what is commonly done in production - that case looks like you can do the same as I did for the other test, switch to XPUB and don't start sending until you get the subscription message

Yeah, that looks like a good solution as well... I can try that next days (probably not tomorrow)

f18m · 2018-09-20T18:46:10Z

Hi @bluca
I created #3255 to fix also test_hwm_pubsub. However even after running that test hundreds of time I could not reproduce the test failure on my desktop machine.
Of course I ran hundreds of times also AFTER applying my changes, and all runs completed OK

f18m · 2018-09-20T19:00:10Z

Very last issue: I cannot see updated docs online at:
http://api.zeromq.org/master:zmq-setsockopt

Moreover there is some discrepancy: in github I see version 4.2.5 available but the website lists version 4.2.3 as latest one (and latest stable docs are from 4.2.3 as well)...

bluca · 2018-09-20T19:39:28Z

Yes the website is not updated automatically

Add new HWM tests and more detailed documentation

f18m added 5 commits September 1, 2018 13:27

Add unit tests for TCP and IPC transports as well as INPROC existing …

743058c

…tests

Add new unit test for HWM testing when a ZMQ proxy is present

8d161c5

Fix endpoint testing

59e1e84

add code showing the issue with proxy STATISTICS command when HWM is hit

db68e26

More logging / reporting. Add SND/RCV timeout on proxy REQ socket to …

9264808

…avoid having asker thread blocked

bluca reviewed Sep 4, 2018

View reviewed changes

bluca changed the title ~~Hwm tests~~ [WIP] Hwm tests Sep 4, 2018

f18m added 3 commits September 8, 2018 10:28

Don't use fixed endpoints, they would make parallel testing fail. Add…

e4ad1c4

… new proxy HWM test also to CMake

Make test somewhat more realistic: send test message string instead o…

bead179

…f empty packets

Make the testcase a real testcase, declaring end conditions and asser…

14b2315

…ting 2*HWM messages are received

bluca reviewed Sep 8, 2018

View reviewed changes

f18m added 2 commits September 8, 2018 12:32

Slightly better documentation for HWM + TCP combination

4a3b798

Disable IPC test on Windows

69ac73f

bluca reviewed Sep 8, 2018

View reviewed changes

f18m added 3 commits September 8, 2018 18:19

remove extra whitespace

336d11b

divide testcases per transport type; try to reduce the time of the te…

2c28a30

…st to avoid failures on TravisCI / appVeyor

To improve test reliability, move RX msg count assert at the end of w…

545a7e1

…hole transmission, after SETTLE_TIME

bluca reviewed Sep 9, 2018

View reviewed changes

f18m added 3 commits September 11, 2018 19:44

Merge remote-tracking branch 'upstream/master' into hwm_tests

bcfa6a7

Try to make test more robust by adding a non-infinite RX timeout; don…

fc948e6

…'t use abstract IPC endpoints since they don't work on iOS

Apply clang formatting rules

438a9c8

fix memleak

ab5f208

remove printf()s

54077c5

bluca reviewed Sep 13, 2018

View reviewed changes

really fix the memleak

d32a3a3

bluca changed the title ~~[WIP] Hwm tests~~ Hwm tests Sep 13, 2018

bluca merged commit eb3e63e into zeromq:master Sep 13, 2018

MohammadAlTurany pushed a commit to FairRootGroup/libzmq that referenced this pull request Jan 28, 2019

Hwm tests and docs (zeromq#3242)

16fd628

Add new HWM tests and more detailed documentation

f18m deleted the hwm_tests branch September 1, 2019 21:15

Hwm tests #3242

Hwm tests #3242

Conversation

f18m commented Sep 4, 2018

bluca Sep 4, 2018

Choose a reason for hiding this comment

bluca commented Sep 4, 2018

f18m commented Sep 8, 2018

bluca Sep 8, 2018

Choose a reason for hiding this comment

f18m commented Sep 8, 2018

bluca Sep 8, 2018

Choose a reason for hiding this comment

bluca commented Sep 8, 2018

f18m commented Sep 9, 2018

bluca commented Sep 9, 2018

bluca Sep 9, 2018

Choose a reason for hiding this comment

f18m Sep 11, 2018

Choose a reason for hiding this comment

f18m commented Sep 11, 2018

f18m commented Sep 11, 2018

f18m commented Sep 13, 2018

bluca Sep 13, 2018

Choose a reason for hiding this comment

f18m Sep 13, 2018

Choose a reason for hiding this comment

bluca commented Sep 13, 2018

bluca commented Sep 13, 2018

bluca commented Sep 13, 2018

f18m commented Sep 14, 2018

f18m commented Sep 15, 2018

bluca commented Sep 15, 2018

bluca commented Sep 15, 2018

bluca commented Sep 15, 2018

bluca commented Sep 15, 2018 • edited Loading

bluca commented Sep 15, 2018

f18m commented Sep 16, 2018 • edited Loading

sigiesec commented Sep 16, 2018

bluca commented Sep 16, 2018

f18m commented Sep 16, 2018

bluca commented Sep 16, 2018

f18m commented Sep 17, 2018 • edited Loading

bluca commented Sep 17, 2018

sigiesec commented Sep 17, 2018

f18m commented Sep 18, 2018

f18m commented Sep 20, 2018

f18m commented Sep 20, 2018

bluca commented Sep 20, 2018

bluca commented Sep 15, 2018 •

edited

Loading

f18m commented Sep 16, 2018 •

edited

Loading

f18m commented Sep 17, 2018 •

edited

Loading