Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a portable way to clear tcp pcb in time-wait #4213

Closed
d-a-v opened this issue Jan 22, 2018 · 10 comments · Fixed by #4314
Closed

provide a portable way to clear tcp pcb in time-wait #4213

d-a-v opened this issue Jan 22, 2018 · 10 comments · Fixed by #4314
Assignees

Comments

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 22, 2018

A common API for both lwip-v1 / lwip-v2 is needed to clear PCBs in TIME-WAIT since this workaround is widely used (tcp_abort(tcp_tw_pcbs))

@d-a-v d-a-v self-assigned this Jan 22, 2018
@d-a-v d-a-v added this to the 2.4.1 milestone Jan 22, 2018
@mribble
Copy link
Contributor

mribble commented Jan 22, 2018

I've been reading through lots of esp8266 bugs related to this and I haven't found anything that explains exactly what is happening so I went and did some research on this topic. I'm sure a lot of people who work on esp8266 understand this, but I want to write down a longer description of the problem for people who aren't experts on tcp protocols (like myself). Experts, if I get anything wrong here please correct me.

This 184 bytes of memory isn't a leak. It is a tracked tcp connection, and is held around because that is what the TCP spec says you should do. The reason it is held around is because creating a tcp connection is expensive (requires a 3 way handshake) so it was assumed clients would want to reuse a connection as a performance optimization. The TCP spec calls this the WAIT_STATE and it's the last state of of a TCP connection where it's just sitting there seeing if the connection will be reused. Here is an article that explains WAIT_STATE in more detail: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

So you might be wondering what does tcp have to do with anything? I'm just seeing this problem when loading webpages on my esp8266. Here are two common places where I think you might run out of memory:

  1. When a lot of different devices connect to the esp8266
  2. When sending a lot of new xmlHttpRequest calls from a single device (like you would with an auto refreshing webpage or when using ajax)

Loading webpages and ajax use http and http is layered on top of tcp. The http protocol have this reuse concept too and they call it "HTTP persistent connection". https://en.wikipedia.org/wiki/HTTP_persistent_connection

I think it's possible to fix the number 2 issue above. My problem is that I'm using ajax and constantly creating new xmlHttpRequest objects for each data transfer. I am planning to try to modify how I use xmlHttpRequests so instead of always creating a new one I will now either reuse one or maybe make a pool of them that I can reuse. This should both improve performance, reduce power usage, and get rid of the out of memory problem for me. If you are doing a full html page refresh you won't have an option to reuse the xmlHttpRequest, but if you spend some time on the webpage you could change it to use ajax to do a full refresh of the body of your page and that would allow you to fix the multiple xmlHttpRequest issue.

I don't think number 1 above can be fixed on the client since each connection is coming from a different device and thus they can't reuse xmlHttpRequest connections. While it might not be common for an esp8266 to host a website that is accessed by lots of devices, I do think it should have a better behavior than consuming all the memory until some other function that requires a few KB of heap space crashes.

I think a final fix for this issue should involve 2 new public APIs:

  1. A way to specify how many seconds a tcp connection stays in the WAIT_STATE before closing it
  2. A way to specify how many tcp connections can remain active before the least recently used is aborted

If there is objection to having both these new APIs then I think the second one is more important because it lets you control the amount of memory consumed by this tcp wait_state algorithm. In microcontroller environments like the esp8266 memory usage is often more important than performance. Some people might want to set it to 0 to keep memory consumption to a minimum. Most people would just use whatever the default is. Once I make the modifications above, I might choose some low number that matches well with however many my client's xmlHttpRequest pool uses (perhaps 8).

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 22, 2018

I should not have written API. I intended to propose an example sketch on how to release those pcb in time-wait state for the two lwip versions. lwip-v1 did not handle them properly and workaround such as the one in your #1923 comment is widely used. Since lwip2, the include files have changed and there are several questions about how to do this.

The fact are that it should not be necessary (but lwip-v1 timers are buggy) and this test shows it with lwip2.

In the example I would provide this explanation, maybe a link to this test, and leave the possibility for the user to keep using it with lwip2.

So since you are facing this problem, could you check whether this workaround is still necessary ?

@mribble
Copy link
Contributor

mribble commented Jan 22, 2018

Thanks for getting back to me d-a-v!

Are you suggesting I try the app in that other thread with the version of lwip-2 in core esp8266 2.4.0? I have verified my app shows the problem and I'm pretty sure the app in that other thread just isn't doing enough tcp connections to show the issue I'm seeing.

My app does a new http connection every 500 ms to keep the ajax responsive. Each request consumes 168 bytes. In 60 seconds that's about 20KB which gets me to a problem area. If I let the esp8266 sit there for another minute or so I see it starts freeing memory. It is freeing this memory too slowly for my app. This agrees with your post in the other thread where you said: "so about ~2min after burst requests, memory is released back by lwip timers." 2 minutes is just too long for some apps.

I do think I can reuse http requests in my app and I'll look into that, but after reading through the esp8266 bugs I can see other people are using ajax like I am and you will probably be getting more of these bug reports. Is there a way for the timers to free memory more quickly? That said, this is going the direction of my solution number 1 above and I really think having a way to limit total memory consumption is what most people used to embedded programming would prefer. Many embedded applications will do all their memory allocation during init and thus need to know hard limits on memory usage. I'm not that fussy about this, but in this case a limited number of connections seems better than a faster way to free old connections.

@mribble
Copy link
Contributor

mribble commented Jan 22, 2018

Two other issues I've noticed with my app with lwip 2:

  1. The values in WiFi.getMode() or WiFi.status() changed. I'm not complaining really because the old values were not what I expected. Just mentioning this could break some apps (it broke my connection indicator LED, but it's really easy for me to fix in the app). No fix needed, just something that you should document as something people need to change when porting to lwip 2.
  2. OTA updates (specifically the advertising) don't seem to work with lwip 2. If this isn't a common issue I can try looking into it more.

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 23, 2018

Maximum time-wait is a defined set to 2mn by default in lwipopts (2*TCP_MSL). Grain is 500ms.

#define TCP_MSL 60000UL /* The maximum segment lifetime in milliseconds */

Reducing it is not advised for heavily loaded web servers, or when dealing with NAT peers. Time-wait helps a tcp connection not confusing two consecutive connections with the same (s-ip,s-port,d-ip,d-port) when the first is closed but duplicate packets lost in internet arrive later during the second.

The workaround is an artificial way to lower it. I would lean to lower it, but it will never be low enough for some, and it would break TCP as it is defined. So the best move here is to show it in an example.

I was not suggesting you try another sketch than yours.

Please open new issues regarding 1. WiFi.getMode() with an example so we can quickly test / verify.
Regarding 2. I believe there has been OTA fixes since 2.4.0, I lost track of them and I let you retry with latest git version of the core. Then browse the opened issues or open a new one about OTA.

@mribble
Copy link
Contributor

mribble commented Jan 23, 2018

If you can show an example to artificially lower that timeout mechanism on the esp8266 that should work for my old use case. I think such an example would be useful and it's still not clear to me how to do that with lwip v2. That said in my app I have started reusing http connections in my javascript so I don't actually have the memory issue anymore even with the default configuration. I think explaining what could be changed to fix this like I did in the first response of this thread would help a lot of people (though some still would need the example from you on how to work around it in the esp8266 code).

For #1, I made a small test app and it didn't reproduce the issue. I narrowed down what what happening in my bigger app. For some reason WiFi.status() is returning WL_DISCONNECTED even when it is connected and working. I'm not sure why this is the case, but I am doing some strange things in my full app. For now I'll assume it's a bug in my app and will look into it in more detail. If I find a real issue in your code I'll file a seperate bug as you suggested.

There are a few fixes in master for OTA. I have verified recent builds work so that seems to be fixed since the 2.4.0 version as you thought it was. Once there is an ESP8266 2.5.0 release with the OTA fixes I will switch my project to using lwip v2, but until then I'll stick with lwip v1.4 to make it easier for others to contribute without a lot of extra install instructions.

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 24, 2018

As promised here is the snippet.

@mribble
Copy link
Contributor

mribble commented Jan 24, 2018

Thanks you! That made it look too easy. I'm sure it will be helpful to a lot of people. 👍

@d-a-v d-a-v removed this from the 2.4.1 milestone Jan 25, 2018
@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 25, 2018

@mribble does lwip2+tcpCleanup fix your problem ?

@mribble
Copy link
Contributor

mribble commented Jan 25, 2018

As I mentioned before I have changed my javascript code to work around the issue so I no longer need this code (and in fact don't want to call it because it would hurt performance).

That said I reverted to an old version of my javascript and confirmed with both lwip 1.4 and 2 this code fixes the problem with too many new xlmHttpRequests.

I can't switch to lwip v2 due to the issues with OTA updates in the esp8266 2.4.0 release, but that seems fixed in master and once there is a 2.5.0 release with those fixes I'll switch to using lwip v2.

Thanks for the great support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants