Skip to content

Socket connect() function has no timeout? #13056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
star297 opened this issue Jun 2, 2020 · 16 comments
Closed

Socket connect() function has no timeout? #13056

star297 opened this issue Jun 2, 2020 · 16 comments

Comments

@star297
Copy link
Contributor

star297 commented Jun 2, 2020

Description of defect

Huge problem with this, I have several remote devices connected on my local network. Some may be 'off-line' at times so I need the connect() function to back out within a few milliseconds. However whatever I try the function locks up for 30 seconds if the remote device IP address has dropped out.
When the remote device is active, the connection time is in the order of 2-3 milliseconds, I would be aiming for a 10 millisecond timeout.
I'm using TCPsocket, however this probably affects other sockets as well.
The 30 seconds is constant so perhaps there's a global value somewhere that I could 'tweak' for a temporary fix?

I think its a similar issue as mentioned here, however it needs to time out before the possibility of starting a new connect():
#https://forums.mbed.com/t/socket-connect-implementation-in-mbed-os/8055

This example below can demonstrate it, however I think its academic as the time out function is not employed with the connect() function (unless I've missed something).

    Timer t1;
    TCPSocket remote_device;
    SocketAddress remote_addr("192.168.1.150",80);
    remote_device.open(net);        

    remote_device.set_timeout(10);  // set 10 mS timeout
    
    t1.start();  
    remote_device.connect(remote_addr); 
    t1.stop();
    printf ("connect time: %f\n", t1.read());

Target(s) affected by this defect ?

Any

Toolchain(s) (name and version) displaying this defect ?

OS 5.15
OS 6 alpha-3

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

Mbed-online.
Studio-online

@ciarmcom
Copy link
Member

ciarmcom commented Jun 2, 2020

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers.
Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2692

@kjbracey
Copy link
Contributor

kjbracey commented Jun 3, 2020

Are you using lwIP? I suspect so. The lwIP implementation has this behaviour as a backwards-compatibility fudge. There was application code that relied on TCPSocket::connect being blocking, even when the socket was set to non-blocking.

You should be able to get it to obey timeouts by simply removing the pair of netconn_set_nonblocking calls here:

https://github.com/ARMmbed/mbed-os/blob/master/features/lwipstack/LWIPStack.cpp#L353

That will make the underlying lwIP implementation of connect be non-blocking (like all the other socket calls), which lets TCPSocket::connect implement the timeout functionality.

There should probably be a config option to allow you to remove that behaviour.

If you're using an off-board stack, like an ESP8266 module, it's not terribly uncommon for them to have limited nonblocking support on things like connect. They might be harder to fix.

@kjbracey
Copy link
Contributor

kjbracey commented Jun 3, 2020

Note that non-blocking or timing-out connects are a tricky thing. If the Mbed OS connect call returns after your specified time, that doesn't necessarily mean the stack has given up - the connect attempt is in progress and will keep going. The timeout is purely on "how long should this call wait for completion before handing back control", not "before giving up".

Another call to connect will check the status of the already-ongoing connection, and let you know if it's still going, succeeded, or given up. That attempt will probably last the full 30 seconds or more regardless.

If you want to try a new connection attempt, you must close and re-open the socket before you do so. You can't make two different connection attempts on one socket.

None of this behaviour is unique to Mbed OS - it's standard POSIX/BSD TCP socket behaviour, respelt into a C++ API. You should be able to find assorted sources.

@star297
Copy link
Contributor Author

star297 commented Jun 4, 2020

Thanks Kevin, I'm using Mbed OS-6 with offline Studio so I assume that would be LWIP.
Using Nucleo-F767ZI on board Ethernet.

That https://github.com/ARMmbed/mbed-os/blob/master/features/lwipstack/LWIPStack.cpp#L353
didn't change anything for me.

I have tried a few fixes:

https://github.com/ARMmbed/mbed-os/blob/master/features/lwipstack/lwip/src/api/lwip_api_msg.c#L1394

setting that to various values 50ms to 5000ms does actually work okay, but throws other errors like:

++ MbedOS Error Info ++
Error Status: 0x80010134 Code: 308 Module: 1
Error Message: Semaphore: 0x834FF0E9, Parameter error
Location: 0x804A2F9
Error Value: 0x834FF0E9
Current Thread: lwip_tcpip Id: 0x2000BF64 Entry: 0x8049EC9 StackSize: 0x4B0 StackMem: 0x2000D870 SP: 0x2007FF58

Using the ESP8266 does not have that 30 second socket connect delay although the Mbed driver is a bit sluggish and you have to wait a while before another connect attempt after null IP address.

Could there be an override available similar here for "tcp-connect-timeout":

https://github.com/ARMmbed/mbed-os/blob/master/features/lwipstack/mbed_lib.json#L113

Does anyone know where that 30 second blocking/time-out is value is defined?
I could try different settings there.

@kjbracey
Copy link
Contributor

kjbracey commented Jun 5, 2020

Using Nucleo-F767ZI on board Ethernet.

That would then be lwIP for IPv4. (You could use Nanostack instead for IPv6, but it doesn't support IPv4. Good chance its connect works better though. That would mean flipping nsapi.default-stack.)

https://github.com/ARMmbed/mbed-os/blob/master/features/lwipstack/lwip/src/api/lwip_api_msg.c#L1394

The nonblocking change I suggested should have stopped it going through that code. It should be taking the if (non_blocking) path instead and returning ERR_INPROGRESS. Did it reach there after that change?

I believe sticking a timeout there will break it because the connect will still be trying to report back to someone who has given up waiting. It's signalling a destroyed semaphore.

I don't see an explicit 30 seconds, but there is a config setting for how many retry attempts to make on connect that could lower the total attempt time: lwip.tcp-synmaxrtx.

although the Mbed driver is a bit sluggish and you have to wait a while before another connect attempt after null IP address.

(Not having checked the code...) It won't be the driver being sluggish, it will be us having to wait for the device to finish the last connect attempt. By setting the timeout we're giving up waiting for a response after 10ms, but not changing the ESP8266s own retry time (not sure if you can on the ESP8266). In lwIP you should be able to do a close after giving up waiting to make it abort, but on the ESP8266 it probably doesn't respond to any requests from us until its given up on the connect itself.

@adbridge
Copy link
Contributor

@star297 please do not remove sections from the issue template! All fields are required to be filled in.

@star297
Copy link
Contributor Author

star297 commented Jun 12, 2020

@adbridge Will do, but I think this can be closed as an 'issue' in any case and added somewhere as an 'enhancement' if and when someone who knows this library has time to add a socket connect timeout. But a fixed 30 seconds timeout is difficult to work with. I did have a look but the code is vast.

However I have attacked this from a different angle where the project can wait 30 seconds on a null IP connect attempt and periodically attempt a re-connection but is far from ideal.
Otherwise the IOT functions on Mbed are excellent, Thank you to all concerned !!

@kjbracey
Copy link
Contributor

But a fixed 30 seconds timeout is difficult to work with.

Did my suggestion of adjusting lwip.tcp-synmaxrtx not let you lower that?

@star297
Copy link
Contributor Author

star297 commented Jun 15, 2020

Yes it does Kevin 👍
I wasn't sure how to implement it at first and I've been stuck on a problem I found with server multiple connections since using non-string based API's.

I've added this to my mbed_app.json file

"lwip.tcp-synmaxrtx": 1,

gives a stable timeout of 5 seconds. Apparently the range is 1 to 12 with, I assume, 5 second multiples (I tried setting 2 which results in a 10 second timeout).
I've been testing it now for a while, completely stable.
I would like to get that down to 1 or 2 seconds if possible, the ESP8266 appears to have a Mbed default of 2 second timeout, "lwip.tcp-synmaxrtx" has no effect on ESP8266's own stack as you mentioned previously.
Could I suggest to set that multiplier to 2 rather than 5? this would then be more cross device compatible and 5 seconds is still a bit of wait tbh.
If you know where that setting is I can change and test it here.

@kjbracey
Copy link
Contributor

I think it might be the 3000 hardcoded in the pcb->rto setup in lwip_tcp.c. This still isn't my ideal route though - my original suggestion really should have worked.

Changing that will speed it up, but will make the whole TCP implementation a bit more aggressive.

@star297
Copy link
Contributor Author

star297 commented Jun 15, 2020

Tried again, just to be sure, but no change.

Using Mbed OS version 5.15.3 on Mbed Studio.

Commented out those non_blocking calls in LWIPStack.cpp

//netconn_set_nonblocking(s->conn, false);
err_t err = netconn_connect(s->conn, &ip_addr, address.get_port());
//netconn_set_nonblocking(s->conn, true);

This is a snip of the test program I'm using here.
https://os.mbed.com/users/star297/code/TCP-client-test/

    TCPSocket remote_device;
    SocketAddress remote_addr("192.168.1.150",80);
    remote_device.open(net); 
       
    //this has no efect
    remote_device.set_timeout(2000);  
        
    led=1;
    printf ("connecting to Server..\n");
    t1.reset();t1.start();  
    remote_device.connect(remote_addr); 
    t1.stop();
    printf ("connect time: %f\n", t1.read()); 
    led=0; 

Changing that 3000 to 1000 as you suggested...

3000 hardcoded in the pcb->rto setup in lwip_tcp.c.

Timeout changed to around 1.5 seconds but then gradually increased back to 5 seconds.
Otherwise stable and no hard faults.

@kjbracey
Copy link
Contributor

Okay, I think I was mistaken about that nonblocking snippet.

I thought it was temporarily setting a nonblocking socket to blocking then putting it back. Not unreasonable, the way it looks.

In fact, I think it wasn't nonblocking to start with. That is the point at which it is set nonblocking for the first time.

Change the code to

netconn_set_nonblocking(s->conn, true);
err_t err = netconn_connect(s->conn, &ip_addr, address.get_port());

@star297
Copy link
Contributor Author

star297 commented Jun 16, 2020

Works perfect!!
Reverted all previous changes and in LWIPStack.cpp applied,

netconn_set_nonblocking(s->conn, true);
err_t err = netconn_connect(s->conn, &ip_addr, address.get_port());

timeout() now functions with socket connect().
I'm testing using 500ms timeout with a data packet of 64 bytes.
Resulting processing time on a successful connection is less than 5ms with a 500ms duration timeout if the IP drops (unsuccessful connection).
Periodically taking the IP off line and back on again works as expected and no hard fault at any time.
Thank you Kevin!

Could this change be applied to the library please?

@kjbracey
Copy link
Contributor

You're welcome to make a patch yourself. Main concern will be backwards compatibility, as I stated before. I think it may have to be behind a config option, leaving default behaviour as it is now :/

I regard this change as a bug fix, but connect not obeying timeouts in lwIP is such long-standing behaviour people will be relying on it.

@star297
Copy link
Contributor Author

star297 commented Jun 16, 2020

I'm thinking it's a relatively easy change to make locally should anyone need to.
With the original Mbed-online migrating to Studio-online, users will have the capability for minor local library changes similar to the depreciated Mbed-DEV library.
So lets leave it for now to save over complicating the official Mbed library unless others request it.

I'll put something on the forum in case any one else has this requirement.

Would you close this issue off unless you need to add anything else.

@0xc0170
Copy link
Contributor

0xc0170 commented Jun 17, 2020

Closing as requested. Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants