Skip to content

Non-blocking recv() isn't really non-blocking with lwip stack #13197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
multiplemonomials opened this issue Jun 26, 2020 · 5 comments · Fixed by #13205
Closed

Non-blocking recv() isn't really non-blocking with lwip stack #13197

multiplemonomials opened this issue Jun 26, 2020 · 5 comments · Fixed by #13205

Comments

@multiplemonomials
Copy link
Contributor

multiplemonomials commented Jun 26, 2020

Description of defect

In my project I need to implement a system with a high update rate (2kHz) that communicates over Ethernet. So far I have sending working fine, but when I attempted receiving I ran into trouble: If there is currently no data in the receive buffer, it seems like UDPSocket::recvfrom() will block for at least 500 us!

I investigated the problem some more and I think I see what's going on. It looks like LWIPStack always creates LWIP sockets in blocking mode with a timeout of 1ms, regardless of what the Socket's blocking mode is set to. So, recv() will always wait about a millisecond for data to arrive before returning.

It seems like the best fix for this would be for LWIPStack to correctly set non-blocking sockets to be non-blocking in LWIP. Another, related issue is that blocking sockets still only use a timeout of 1, so they turn into a resource-intensive spinlock if recv() is called with a longer timeout. It looks like this could be avoided pretty easily by proxying the actual timeout value to lwip, so it will then mutex wait for the correct amount of time before returning.

Patch:

There are probably a lot of changes needed to properly handle non-blocking sockets, so I came up with a simple (but dirty) patch:
In sys_arch_mbox_fetch() in lwip_sys_arch.c, I changed:

    uint32_t flags = osEventFlagsWait(mbox->id, SYS_MBOX_FETCH_EVENT,
            osFlagsWaitAny | osFlagsNoClear, (timeout ? timeout : osWaitForever));

to

    uint32_t osTimeout;
    if(timeout == 0)
    {
        // infinite wait
        osTimeout = osWaitForever;
    }
    else if(timeout == 1)
    {
        // PATCH: treat 1ms timeout as 0
        osTimeout = 0;
    }
    else
    {
        osTimeout = timeout;
    }

    uint32_t flags = osEventFlagsWait(mbox->id, SYS_MBOX_FETCH_EVENT,
            osFlagsWaitAny | osFlagsNoClear, osTimeout);

Since LWIPInterface uses a timeout of 1 everywhere, this will change it to a timeout of 0, so the mutex wait will return immediately if there is no message. We need to do this here because otherwise it changes the 0 value to osWaitForever.

In my testing this brings the recv() delay down to where it should be, about 10 us. It will prevent blocking sockets from working properly though so keep that in mind. I also haven't tested it with TCP.

Target(s) affected by this defect ?

I was testing on a NUCLEO_F429ZI, but I think that this affects all targets.

Toolchain(s) (name and version) displaying this defect ?

GCC_ARM 2019 q4

What version of Mbed-os are you using (tag or sha) ?

mbed-os-5.15.0

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

I'm using my custom build system, https://github.com/USCRPL/mbed-cmake

This integrates with the MBed python scripts, and should have the same build behavior as mbed-cli.

How is this defect reproduced ?

  1. Create a UDP socket in non-blocking mode. Don't send any data to it.
  2. Call socket.recvfrom().

Expected behavior: With no data in the socket I expect recvfrom() to take no more than double digit numbers of microseconds to run.

Actual behavior: recvfrom() takes about 480-500us to execute.

@ciarmcom
Copy link
Member

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers.
Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2743

@0xc0170
Copy link
Contributor

0xc0170 commented Jun 29, 2020

cc @ARMmbed/mbed-os-ipcore

@kjbracey
Copy link
Contributor

kjbracey commented Jun 29, 2020

Thanks for identifying this.

I was unaware of this - I was convinced we were using non-blocking calls to lwIP.

Indeed, TCP sockets are set to non-blocking.

This is related to #13056, where there was confusion because I thought that sockets were created non-blocking, and that the lines in connect were temporarily setting to blocking and putting it back.

When I realised that wasn't the case, and connect was setting it non-blocking for the first time, alarm bells should have gone off.

If connect is setting TCP sockets non-blocking for the first time, then who is setting UDP sockets non-blocking? No-one, apparently. Eep!

Another, related issue is that blocking sockets still only use a timeout of 1, so they turn into a resource-intensive spinlock if recv() is called with a longer timeout.

Not quite following this comment - all LWIPStack calls are supposed to always be non-blocking, regardless of the nsapi-level blocking/timeout setting, and InternetDatagramSocket::recvfrom will wait for a call to the socket_attach callback before retrying, if it wants to block.

I would hope that just replacing that netconn_set_recvtimeout(s->conn, 1) with netconn_set_nonblocking(s->conn, true) should work. TCP sockets are operating in that mode already.

(I think we could then turn off LWIP_SO_RCVTIMEO to save a tiny bit of ROM and RAM).

@multiplemonomials
Copy link
Contributor Author

multiplemonomials commented Jun 29, 2020

Alright I tried that change and everything seems to be working. Thanks!

Another, related issue is that blocking sockets still only use a timeout of 1, so they turn into a resource-intensive spinlock if recv() is called with a longer timeout.

Scratch that, it was just my misunderstanding how that part of the socket code works plus an issue with my test code causing bad performance. Everything seems to be fine with blocking sockets.

@0xc0170
Copy link
Contributor

0xc0170 commented Jun 30, 2020

@multiplemonomials Thanks for detailed report and @kjbracey-arm for the instant fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants