-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hwm tests #3242
Hwm tests #3242
Conversation
…avoid having asker thread blocked
tests/test_hwm_pubsub.cpp
Outdated
// send 1000 msg on hwm 1000, receive 1000 | ||
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000)); | ||
// send 1000 msg on hwm 1000, receive 1000, on TCP transport | ||
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, ENDPOINT_0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause tests that run in parallel to fail, please use "tcp://127.0.0.1:*" and change the code to use the _ENDPOINT socket option
Please also add the new test to tests/CMakeLists.txt |
… new proxy HWM test also to CMake
…ting 2*HWM messages are received
Hi @bluca, I'll keep an eye on automatic appVeyor and TravisCI checks ! Thanks |
tests/test_hwm_pubsub.cpp
Outdated
|
||
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "tcp://127.0.0.1:*")); | ||
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "inproc://a")); | ||
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "ipc://@tmp-tester")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPC does not work on Windows so it will fail, this line and the other one need ifndef ZMQ_HAVE_WINDOWS
Hi @bluca, |
doc/zmq_setsockopt.txt
Outdated
@@ -854,11 +858,13 @@ If this limit has been reached the socket shall enter an exceptional state and | |||
depending on the socket type, 0MQ shall take appropriate action such as | |||
blocking or dropping sent messages. Refer to the individual socket descriptions | |||
in linkzmq:zmq_socket[3] for details on the exact action taken for each socket | |||
type. | |||
type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this extra whitespace
The HWM tests are failing on Travis, could you please check? Thanks |
…st to avoid failures on TravisCI / appVeyor
…hole transmission, after SETTLE_TIME
I checked and some of the failures are related to TCP transport testcases failing due to timing issues... I moved a check after a wait of SETTLE_TIME ms to improve test reliability. Other builds are failing due to cclang issues about source code formatting... do you have any hint on how to fix these? Thanks! |
There's only one clang format job and it prints the diff that needs to be applied, you can also run make format-check But there's also the IPC test failing to bind on osx |
tests/test_hwm_pubsub.cpp
Outdated
|
||
void test_ipc () | ||
{ | ||
TEST_ASSERT_EQUAL_INT (1000, test_defaults (1000, 1000, "ipc://@tmp-tester1")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue on OSX is that this is trying to use the abstract namespace, which as the zmq_ipc manpage says it's linux only - you can use /tmp or a wildcard ipc://* and the _endpoint call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, should be fixed now
…'t use abstract IPC endpoints since they don't work on iOS
@bluca : is there some special macro to emit debugging prints from a unit test source file? ../tests/test_proxy_hwm.cpp:63:23: error: anonymous variadic macros were introduced in C99 [-Werror=variadic-macros] so shall I use printf() directly? Shall I remove debug prints (this could be a bad idea for debugging in future)? Thanks |
Moreover I see that on OSX the trick of using "ipc://*" and then ZMQ_LAST_ENDPOINT seems to fail: ../../tests/test_hwm_pubsub.cpp:55:test_ipc:FAIL: zmq_getsockopt (pub_socket, ZMQ_LAST_ENDPOINT, pub_endpoint, &len) failed, errno = 22 (Invalid argument) am I missing something? Is this a bug in ZMQ OSX implementation? |
well maybe I found the root cause of the memleak actually: even if valgrind reports the leak on the PUB side it is probably due to the SUB side thread. |
tests/test_proxy_hwm.cpp
Outdated
TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt ( | ||
subsocket, ZMQ_RCVTIMEO, &timeout_ms, sizeof (timeout_ms))); | ||
} else { | ||
TEST_ASSERT_SUCCESS_ERRNO (zmq_msg_close (&msg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this close call needs to be in the other branch of the if, that's what's causing the leak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right - my fault.. I intended to puth the close on the success path indeed
It's still there, see comment inline for the fix |
@f18m thanks for your PR, it looks good now, merged. Could you please send another PR with a relicensing grant. See README here for details: https://github.com/zeromq/libzmq/tree/master/RELICENSE and this example: https://github.com/zeromq/libzmq/blob/master/RELICENSE/WenbinHou.md |
@f18m actually it's causing a failure to build with older gcc:
Could you please have a look? |
Hi @bluca, |
Hi @bluca , I also installed a VM with OpenSUSE 12.3, gcc 4.7.2 and tried to build zmq and run unit tests: everything was successful... |
Yeah but it fails plenty of times on old distros: https://build.opensuse.org/package/live_build_log/network:messaging:zeromq:git-stable/libzmq/CentOS_6/i586 So there is a race condition somewhere that might or might not trigger. It either needs to be fixed, or the commit reverted. |
Try an ASAN build on your SUSE 12.3 VM, it might help identify the issue |
Actually fails also on Rawhide and Leap 15, so with much newer compilers. It looks to be a low-chance race causing a segfault. Probably thread unsafe access? |
Managed to reproduce on Ubuntu 14.04 (gcc 4.8) after ~10 runs in gdb:
It's the publisher thread in the test |
So it's the unity assert that triggers the segfault, for some reason... gdb can't recreate the stack trace after the segfault, but setting a breakpoint at __longjmp works:
@sigiesec any idea why unity segfaults like that in TEST_ABORT when ran in another thread? Do we need to tweak the build? @f18m the HWM assert is failing here: https://github.com/zeromq/libzmq/blob/master/tests/test_proxy_hwm.cpp#L133 So if it didn't segfault, it would fail. It happens once in 10-20 runs. Compiler version doesn't seem to matter. Could you please check? |
Hi @bluca , Is this the result of some race condition in HWM "handshaking" internal to zmq::pipe_t ? In the test_proxy_hwm the HWM values of PUB and SUB sockets are set BEFORE calling zmq_connect(). The HWM values of the proxy XPUB and XSUB sockets are set BEFORE calling zmq_bind(). From my understanding when the chain Is there some logic on why this happens? Of course the workaround fix for the test is trivial and involves adding 3*HWM condition to: Perhaps I could try adding a pthread_barrier_t to allow all sockets to handshake their HWM values before starting sending messages? |
The unity assertions should be threadsafe AFAIK. Maybe there is some corruption before, which results in the segfault in the assertion? |
@sigiesec the thread looks extremely simple, can't see where that would happen, any idea? I'll switch to standard assert for now
@f18m I don't think that's the issue. They are finalised in a critical section so there's already barriers in place: https://github.com/zeromq/libzmq/blob/master/src/ctx.cpp#L570 I think there are broadly speaking 2 problems. The way PUB is designed to work, is that if there are no connections (no pipes), it will discard the message immediately, without even trying to queue it on the pipe, and thus the HWM never increments: https://github.com/zeromq/libzmq/blob/master/src/dist.cpp#L164 The NODROP option was added much later (and IMHO it's a bit hacky...), and simply didn't consider this case. And I'm not sure it should either, but that's another story. So on slow architecture or overworked builders the msleep is not enough., and the PUB thread starts and sends messages without having any subscriptions (but without EAGAIN either) so the counter increases massively. I've seen this a couple of times. To achieve that there are either other socket types, or design patterns that use an out-of-band deterministic socket to synchronise (eg: START/STOP messages), or much simpler, use XPUB and wait to receive the subscription before starting to publish, which will be deterministic in the sense that it won't start until there is at least one listener, regardless of system load. The second and most important issue is that you have 3 threads doing I/O across 2 queues. Depending on the scheduling, it might happen that 20, 30 or 40 messages go through before the pub blocks. https://github.com/zeromq/libzmq/blob/master/src/pipe.cpp#L483 So depending on the scheduling of the second thread, the publisher might get one, two or three more batches in. The ceiling is 40 as there's 2 queues. So the test should really check for a range - HWM * 2 * npipes. |
@bluca , With these changes the test ran fine 200 times on my system while it failed after 20-30 runs previously. |
I had opened #3253 but I'll rebase it instead. I think it's better to have a range, just to be sure. There's also test_hwm_pubsub that fails, even more rarely but still happens (took ~50 runs in valgrind to repro on my desktop):
|
yeah I guess that's a race condition due to the msleep() call: libzmq/tests/test_hwm_pubsub.cpp Line 148 in ea517a2
That triggers me the question: wherever I see that msleep (SETTLE_TIME); in testing code I guess there's a possible race condition... isn't it ? |
The right solution is to synchronize, which is what is commonly done in production - that case looks like you can do the same as I did for the other test, switch to XPUB and don't start sending until you get the subscription message |
I think they are not race conditions in a strict sense, but timing dependencies. There are cases where it is used as a timeout, assuming that if something does not happen within SETTLE_TIME, it will never happen. On slow machines, this may lead to false positives, in the sense that the "something" happens, but only after the timeout. Other cases expect that something happens, assuming that if it ever happens, it will within SETTLE_TIME. On slow machines, this leads to false negatives, i.e. a test cases fails but the "something" would have happened after the timeout. By increasing SETTLE_TIME, the probability of such false positives and negatives could be reduced, but not eliminated. However, increasing the SETTLE_TIME will also increase the test execution time and therefore reduce feedback from the CI jobs. In some cases, the test code is probably suboptimal as sleeping for a fixed time could be replaced by an operation with timeout (in the second case), and in others by explicitly waiting for the negated event (in the first case), the latter possibly requiring some extension of libzmq. |
Hi @sigiesec ,
This really sounds like a race condition (between zeromq background threads and the test main thread)... anyway my intent is not point out this aspect, just understanding. @bluca ,
Yeah, that looks like a good solution as well... I can try that next days (probably not tomorrow) |
Very last issue: I cannot see updated docs online at: Moreover there is some discrepancy: in github I see version 4.2.5 available but the website lists version 4.2.3 as latest one (and latest stable docs are from 4.2.3 as well)... |
Yes the website is not updated automatically |
Add new HWM tests and more detailed documentation
Problem: correct HWM handling is checked only on INPROC transport
Problem: the ZMQ proxy control socket is unusable when the proxy reaches its XPUB HWM and lossy flag is set.
Solution:
Note: this PR is still work-in-progress. In particular I need to think how to organize the new unit test on the proxy; currently it requires human inspection to understand what's going on.