Non-blocking recv() isn't really non-blocking with lwip stack #13197

multiplemonomials · 2020-06-26T23:35:28Z

Description of defect

In my project I need to implement a system with a high update rate (2kHz) that communicates over Ethernet. So far I have sending working fine, but when I attempted receiving I ran into trouble: If there is currently no data in the receive buffer, it seems like UDPSocket::recvfrom() will block for at least 500 us!

I investigated the problem some more and I think I see what's going on. It looks like LWIPStack always creates LWIP sockets in blocking mode with a timeout of 1ms, regardless of what the Socket's blocking mode is set to. So, recv() will always wait about a millisecond for data to arrive before returning.

It seems like the best fix for this would be for LWIPStack to correctly set non-blocking sockets to be non-blocking in LWIP. Another, related issue is that blocking sockets still only use a timeout of 1, so they turn into a resource-intensive spinlock if recv() is called with a longer timeout. It looks like this could be avoided pretty easily by proxying the actual timeout value to lwip, so it will then mutex wait for the correct amount of time before returning.

Patch:

There are probably a lot of changes needed to properly handle non-blocking sockets, so I came up with a simple (but dirty) patch:
In sys_arch_mbox_fetch() in lwip_sys_arch.c, I changed:

    uint32_t flags = osEventFlagsWait(mbox->id, SYS_MBOX_FETCH_EVENT,
            osFlagsWaitAny | osFlagsNoClear, (timeout ? timeout : osWaitForever));

to

    uint32_t osTimeout;
    if(timeout == 0)
    {
        // infinite wait
        osTimeout = osWaitForever;
    }
    else if(timeout == 1)
    {
        // PATCH: treat 1ms timeout as 0
        osTimeout = 0;
    }
    else
    {
        osTimeout = timeout;
    }

    uint32_t flags = osEventFlagsWait(mbox->id, SYS_MBOX_FETCH_EVENT,
            osFlagsWaitAny | osFlagsNoClear, osTimeout);

Since LWIPInterface uses a timeout of 1 everywhere, this will change it to a timeout of 0, so the mutex wait will return immediately if there is no message. We need to do this here because otherwise it changes the 0 value to osWaitForever.

In my testing this brings the recv() delay down to where it should be, about 10 us. It will prevent blocking sockets from working properly though so keep that in mind. I also haven't tested it with TCP.

Target(s) affected by this defect ?

I was testing on a NUCLEO_F429ZI, but I think that this affects all targets.

Toolchain(s) (name and version) displaying this defect ?

GCC_ARM 2019 q4

What version of Mbed-os are you using (tag or sha) ?

mbed-os-5.15.0

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

I'm using my custom build system, https://github.com/USCRPL/mbed-cmake

This integrates with the MBed python scripts, and should have the same build behavior as mbed-cli.

How is this defect reproduced ?

Create a UDP socket in non-blocking mode. Don't send any data to it.
Call socket.recvfrom().

Expected behavior: With no data in the socket I expect recvfrom() to take no more than double digit numbers of microseconds to run.

Actual behavior: recvfrom() takes about 480-500us to execute.

The text was updated successfully, but these errors were encountered:

ciarmcom · 2020-06-27T05:25:26Z

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers.
Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2743

0xc0170 · 2020-06-29T08:56:19Z

cc @ARMmbed/mbed-os-ipcore

kjbracey · 2020-06-29T11:10:46Z

Thanks for identifying this.

I was unaware of this - I was convinced we were using non-blocking calls to lwIP.

Indeed, TCP sockets are set to non-blocking.

This is related to #13056, where there was confusion because I thought that sockets were created non-blocking, and that the lines in connect were temporarily setting to blocking and putting it back.

When I realised that wasn't the case, and connect was setting it non-blocking for the first time, alarm bells should have gone off.

If connect is setting TCP sockets non-blocking for the first time, then who is setting UDP sockets non-blocking? No-one, apparently. Eep!

Another, related issue is that blocking sockets still only use a timeout of 1, so they turn into a resource-intensive spinlock if recv() is called with a longer timeout.

Not quite following this comment - all LWIPStack calls are supposed to always be non-blocking, regardless of the nsapi-level blocking/timeout setting, and InternetDatagramSocket::recvfrom will wait for a call to the socket_attach callback before retrying, if it wants to block.

I would hope that just replacing that netconn_set_recvtimeout(s->conn, 1) with netconn_set_nonblocking(s->conn, true) should work. TCP sockets are operating in that mode already.

(I think we could then turn off LWIP_SO_RCVTIMEO to save a tiny bit of ROM and RAM).

multiplemonomials · 2020-06-29T21:59:35Z

Alright I tried that change and everything seems to be working. Thanks!

Another, related issue is that blocking sockets still only use a timeout of 1, so they turn into a resource-intensive spinlock if recv() is called with a longer timeout.

Scratch that, it was just my misunderstanding how that part of the socket code works plus an issue with my test code causing bad performance. Everything seems to be fine with blocking sockets.

0xc0170 · 2020-06-30T08:47:02Z

@multiplemonomials Thanks for detailed report and @kjbracey-arm for the instant fix

ciarmcom added type: bug Jira status: OPEN labels Jun 27, 2020

kjbracey mentioned this issue Jun 30, 2020

LWIPStack: set sockets non-blocking #13205

Merged

0xc0170 closed this as completed in #13205 Jul 1, 2020

ciarmcom added Jira status: CLOSED and removed Jira status: OPEN labels Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking recv() isn't really non-blocking with lwip stack #13197

Non-blocking recv() isn't really non-blocking with lwip stack #13197

multiplemonomials commented Jun 26, 2020 •

edited

Loading

ciarmcom commented Jun 27, 2020

0xc0170 commented Jun 29, 2020

kjbracey commented Jun 29, 2020 •

edited

Loading

multiplemonomials commented Jun 29, 2020 •

edited

Loading

0xc0170 commented Jun 30, 2020

Non-blocking recv() isn't really non-blocking with lwip stack #13197

Non-blocking recv() isn't really non-blocking with lwip stack #13197

Comments

multiplemonomials commented Jun 26, 2020 • edited Loading

Description of defect

Patch:

Target(s) affected by this defect ?

Toolchain(s) (name and version) displaying this defect ?

What version of Mbed-os are you using (tag or sha) ?

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

How is this defect reproduced ?

ciarmcom commented Jun 27, 2020

0xc0170 commented Jun 29, 2020

kjbracey commented Jun 29, 2020 • edited Loading

multiplemonomials commented Jun 29, 2020 • edited Loading

0xc0170 commented Jun 30, 2020

multiplemonomials commented Jun 26, 2020 •

edited

Loading

kjbracey commented Jun 29, 2020 •

edited

Loading

multiplemonomials commented Jun 29, 2020 •

edited

Loading