Problem: UDP engine aborts on networking-related errors from socket syscalls #2862

ItsNayabSD · 2017-12-13T11:32:34Z

Description

I have a dish socket bind to a multicast address. And dish socket receiving messages in a while loop.
It works fine when there is an Ethernet interface is up.
I did sudo ifconfig enp7s0 0.0.0.0 and observed the following:

No such device (src/udp_engine.cpp:142)
Aborted (core dumped)

I am using this dish socket in an application in which the ip address of the interface often becomes 0.0.0.0.
Is there any way I could exit the while loop safely without core dumping the entire application?

Environment

Version: Zeromq 4.2.1
OS: Ubuntu 16.04

The text was updated successfully, but these errors were encountered:

bluca · 2017-12-13T18:38:14Z

I guess the UDP engine needs the same hardening the TCP one got a few months ago, against network errors. That's a setsockopt failing. PRs welcome.

ItsNayabSD · 2017-12-14T05:54:00Z

I am not much into C++. :(
And I see that core dump is happening even for the interface which has static IP but not connected to the network.

Any condition I could check and break the loop?

ItsNayabSD · 2017-12-19T11:51:59Z

Hi,

I had to add

route add -net 224.0.0.0 netmask 240.0.0.0 eth0

so that radio socket finds a way for multicast traffic. Now core dump is not happening. :)

Thanks..

bluca · 2017-12-19T11:55:28Z

Great, happy you found a workaround and thanks for sharing it.

I'll reopen and retitle the issue, as the UDP implementation should be hardened anyway before it can be declared stable.

ItsNayabSD · 2018-01-05T18:20:09Z

I am able to reproduce the crash with v4.2.3 also. But the scenario is different.
This time GDB prints:

(gdb) bt
#0  0xb6bf8424 in __GI_raise (sig=sig@entry=6) at libpthread/nptl/sysdeps/unix/sysv/linux/raise.c:67
#1  0xb6bf27f0 in __GI_abort () at libc/stdlib/abort.c:89
#2  0xb6ed0e14 in zmq::zmq_abort (errmsg_=errmsg_@entry=0xb6c07210 <mylock> "") at src/err.cpp:87
#3  0xb6f07744 in zmq::udp_engine_t::out_event (this=<optimized out>) at src/udp_engine.cpp:285
#4  0xb6f06ca4 in zmq::udp_engine_t::restart_output (this=0x2061d0) at src/udp_engine.cpp:307
#5  0xb6eeea08 in zmq::session_base_t::read_activated (this=0x1fddd8, pipe_=0xb6edd454 <zmq::object_t::process_command(zmq::command_t&)+220>) at src/session_base.cpp:288
#6  0xb6ed1ea4 in zmq::io_thread_t::in_event (this=0x1fb2e8) at src/io_thread.cpp:85
#7  0xb6ed05e8 in zmq::epoll_t::loop (this=0x1fb808) at src/epoll.cpp:188
#8  0xb6f049a8 in thread_routine (arg_=0x1fb854) at src/thread.cpp:109
#9  0xb6f33b04 in start_thread (arg=0xb66f1520) at libpthread/nptl/pthread_create.c:297
#10 0xb6bf7b44 in clone () at libpthread/nptl/sysdeps/unix/sysv/linux/arm/../../../../../../../libc/sysdeps/linux/arm/clone.S:126
#11 0xb6bf7b44 in clone () at libpthread/nptl/sysdeps/unix/sysv/linux/arm/../../../../../../../libc/sysdeps/linux/arm/clone.S:126
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) up 3
#3  0xb6f07744 in zmq::udp_engine_t::out_event (this=<optimized out>) at src/udp_engine.cpp:285
289	        errno_assert (rc != -1);
(gdb) p errno
Cannot find thread-local variables on this target
(gdb) p strerror(errno)
Cannot find thread-local variables on this target

And root cause for crash is there is no gateway entry. We had to add some dummy gateway entry to route table.

simias · 2018-05-14T16:51:48Z

I'd like to look into that but I'm not sure how to report the error to the caller given that the out_event method returns void. I tried to look at the TCP implementation to figure out how it's handled there but the code seems so wildly different that I couldn't really figure out if anything was applicable for UDP.

What's the correct way of dealing with this error? I think returning an error in zmq_send would make the most sense but given the threading going on it's not obvious to me how that would be done. Maybe the error should be saved and returned on subsequent calls?

bluca · 2018-05-14T16:54:12Z

With TCP, on recoverable/temporary errors the I/O thread engine simply tries again later. Can the UDP engine do that too?

simias · 2018-05-14T17:04:02Z

I'd have to look into that. In the case of UDP I wonder if it makes a lot of sense though, given the best effort nature of the protocol (especially in multicast). After all even if the kernel manages to send the packet you never have any guarantee that it'll ever reach its destination.

What happens if the messages pile up with TCP, I assume eventually they're simply dropped?

In my case the error returned by the sendto is EADDRNOTAVAIL, I don't know if it should really be considered recoverable or temporary. I think in TCP the error would be caught earlier during the connect call which obviously has no counterpart for UDP.

bluca · 2018-05-14T17:07:08Z

With TCP I think the messages will fill the queue, and what happens depends on the HWM settings at that point

simias · 2018-05-14T17:12:23Z

Now that I think about it even the calls to bind() and other syscalls in zmq::udp_engine_t::plug ought to report an error somehow instead of aborting. They seem less likely to fail "spuriously" but still.

bluca · 2018-05-14T17:13:51Z

The way to report status on the handshake and related statuses is via socket monitor events, if they happen in the I/O thread

simias · 2018-05-14T17:19:49Z

Ah yeah that would work, do you think it could be used to handle UDP send errors as well or is inappropriate?

bluca · 2018-05-14T17:24:11Z

IMHO that would be way too much traffic, and as you said UDP is unreliable by nature

ItsNayabSD · 2018-11-16T06:22:38Z

We upgraded the package and still I can see the crash when zmq_send fails. :(

simias · 2019-02-07T12:51:38Z

Yeah I hit that again. For the moment I still have an ugly hack were I comment the assert in zmq::udp_engine_t::out_event to ignore the failure. Obviously it's not great...

I'd be interested in implementing a cleaner solution but I'm still unsure what to do. I tried taking inspiration from the TCP code but (as mentioned in previous comments) I don't really think it makes sense to retry when sendto fails. That being said, completely ignoring the error and not reporting it to the sender also seems like a poor idea.

…eromq#2862

…eromq#2862 (fix)

…eromq#2862 (clang-format)

…eromq#2862 (remove unused function)

…eromq#2862 (format)

…2862 (#3638) * UDP engine aborts on networking-related errors from socket syscalls #2862 * Add relicense statement

…eromq#2862

…eromq#2862 (fix)

…eromq#2862 (revert changes in error list in zmq::assert_success_or_recoverable)

#2862 (#3640) * UDP engine aborts on networking-related errors from socket syscalls #2862

ItsNayabSD closed this as completed Dec 19, 2017

bluca reopened this Dec 19, 2017

bluca changed the title ~~[Help] zeromq: Aborted (device not found). How to return safely when interface is down~~ Problem: UDP engine aborts on networking-related errors from socket syscalls Dec 19, 2017

bluca added Area (Runtime / Usage) Starter Tasks Symptom (Crash/Race/Undefined behavior) Transport (UDP) Useful Information labels Dec 19, 2017

bluca mentioned this issue Jan 5, 2018

[Help] Process aborted going infinite loop at src/fq.cpp:40 #2881

Closed

bluca modified the milestones: RADIO/DISH: declare as STABLE, UDP transport: declare as STABLE Mar 18, 2018

bluca mentioned this issue May 14, 2018

UDP engine out_event aborts when sendto fails #3102

Closed

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

4e2c85e

…eromq#2862

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

fba7c34

…eromq#2862

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

5db9915

…eromq#2862 (fix)

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

fe2287d

…eromq#2862 (clang-format)

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

c2f57a3

…eromq#2862 (remove unused function)

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 22, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

528c2e7

…eromq#2862 (format)

bluca pushed a commit that referenced this issue Aug 22, 2019

UDP engine aborts on networking-related errors from socket syscalls #…

f48c86d

…2862 (#3638) * UDP engine aborts on networking-related errors from socket syscalls #2862 * Add relicense statement

bluca closed this as completed Aug 22, 2019

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 24, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

f33c124

…eromq#2862

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 24, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

cb1f24d

…eromq#2862

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 24, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

4cf5eb0

…eromq#2862 (fix)

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 24, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

ba5b458

…eromq#2862 (fix)

atomashpolskiy added a commit to atomashpolskiy/libzmq that referenced this issue Aug 25, 2019

UDP engine aborts on networking-related errors from socket syscalls z…

214e045

…eromq#2862 (revert changes in error list in zmq::assert_success_or_recoverable)

bluca pushed a commit that referenced this issue Aug 25, 2019

UDP engine aborts on networking-related errors from socket syscalls (2)

2aa87c9

#2862 (#3640) * UDP engine aborts on networking-related errors from socket syscalls #2862

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem: UDP engine aborts on networking-related errors from socket syscalls #2862

Problem: UDP engine aborts on networking-related errors from socket syscalls #2862

ItsNayabSD commented Dec 13, 2017

bluca commented Dec 13, 2017

ItsNayabSD commented Dec 14, 2017 •

edited

Loading

ItsNayabSD commented Dec 19, 2017 •

edited

Loading

bluca commented Dec 19, 2017

ItsNayabSD commented Jan 5, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

ItsNayabSD commented Nov 16, 2018

simias commented Feb 7, 2019

Problem: UDP engine aborts on networking-related errors from socket syscalls #2862

Problem: UDP engine aborts on networking-related errors from socket syscalls #2862

Comments

ItsNayabSD commented Dec 13, 2017

Description

Environment

bluca commented Dec 13, 2017

ItsNayabSD commented Dec 14, 2017 • edited Loading

ItsNayabSD commented Dec 19, 2017 • edited Loading

bluca commented Dec 19, 2017

ItsNayabSD commented Jan 5, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

simias commented May 14, 2018

bluca commented May 14, 2018

ItsNayabSD commented Nov 16, 2018

simias commented Feb 7, 2019

ItsNayabSD commented Dec 14, 2017 •

edited

Loading

ItsNayabSD commented Dec 19, 2017 •

edited

Loading