Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async TcpStream cannot send larger Amounts of Data #472

Closed
fko-kuptec opened this issue Aug 21, 2024 · 14 comments
Closed

Async TcpStream cannot send larger Amounts of Data #472

fko-kuptec opened this issue Aug 21, 2024 · 14 comments

Comments

@fko-kuptec
Copy link

fko-kuptec commented Aug 21, 2024

Description

I am honestly not sure, whether this is the right place for this issue. In general, I am currently trying to find any working solution for async networking on my ESP32-S3. More specifically, I want to get an async HTTP server working.

I am aware of the tcp_async example that is using async-io for TCP communication. However, I got a lot of different panics when trying to use it. At least in combination with an Wifi AP, I was not able to get async-io working for me at all. Even without networking, just trying to use the Timer did not work. Increasing the default pthread stack size helped avoiding stack overflows, but I still got other panics, iirc.

I really wanted to avoid tokio, but when i finally tried it, it actually worked! The timer works, even TCP works. However, when I tested serving actual files over HTTP, I found that only sending very small files is possible. As soon as larger chunks of data are written to the TcpStream, the write_all call never returns. The same issue shows when limiting the call to write_all with a timeout and then using writable to wait for the stream to become ready. The call to writable never returns. On the client side, when requesting the file with PowerShell's Invoke-WebRequest tool, the download was stuck repeatedly at a specific number of bytes.

Speculation

My guess is, that with sending larger amounts of data, some buffer is getting full, and that the implementation fails to send an event when the buffer is free again. But I am just not enough into the code to find a solution or even to prove, that this assumption is correct. Maybe someone else can help here?

Example

I created an example repo which shows this issue.

@fko-kuptec
Copy link
Author

Well, after some testing I would contradict my theory of a missing event, when the send buffer is free again. I've reimplemented the given example with the polling crate and a normal std::net::TcpSocket. The changes are visible here. This way I was able to implement event based TCP IO and was well able to send larger files. Therefore I would assume, that there is somewhere a bug in tokio or mio. At least that is what I would assume at this point.

@ivmarkov
Copy link
Collaborator

Description

I am honestly not sure, whether this is the right place for this issue. In general, I am currently trying to find any working solution for async networking on my ESP32-S3. More specifically, I want to get an async HTTP server working.

Thank you for providing such a detailed feedback!

I am aware of the tcp_async example that is using async-io for TCP communication. However, I got a lot of different panics when trying to use it. At least in combination with an Wifi AP, I was not able to get async-io working for me at all. Even without networking, just trying to use the Timer did not work. Increasing the default pthread stack size helped avoiding stack overflows, but I still got other panics, iirc.

There are two reasons for the panics:

  • Reason 1: Stack overflows, as you correctly pointed out. The stock async-io is especially prone to stack overflows for two reasons:
    • It runs a hidden thread named async-io. This thread runs with with the pthread stack size you have configured in sdkconfig.defaults. If you have not configured anything, it is really low (like ~ 3K) so at a minimum, you might want to double it (or even triple it), with this setting
    • The Reactor of async-io is created lazily, using a lazy_staic construct (or similar; don't remember the details). The thing is, I've noticed that even though it is created statically, since it is created lazily upon first hit, the memory for it is first allocated on stack and only after moved to the static location (this is the usual trouble of Rust moves not being very optimized if you know what I mean). What usually helps is to "provoke" the lazy creation of the Reactor, by - say - instantiating a timer early on in your program, from a thread whose stack you control. Otherwise, it might happen in the hidden async-io thread and as I said if you have not tripled its stack size, it might blow up
  • Reason 2: The other reason is this bug which I've promised to follow up on several months ago but oh well. :( It triggers on the original esp32 xtensa chip, but I would not be surprised if it also triggers on the s3 variation you are using and is contributing to your troubles.

OK, but what can you do?
I would say, stick with async-io, but until the above bug ^^^ is fixed, perhaps use this variation of mine which does not exhibit a similar compiler bug and also avoids some of the issues with async-io from above (the lazy_static stuff; the requirement for large stack sizes)

I really wanted to avoid tokio, but when i finally tried it, it actually worked! The timer works, even TCP works. However, when I tested serving actual files over HTTP, I found that only sending very small files is possible. As soon as larger chunks of data are written to the TcpStream, the write_all call never returns. The same issue shows when limiting the call to write_all with a timeout and then using writable to wait for the stream to become ready. The call to writable never returns. On the client side, when requesting the file with PowerShell's Invoke-WebRequest tool, the download was stuck repeatedly at a specific number of bytes.

The problem with tokio is that the original contributor of the code that extended mio so that it can use the poll syscall (which is what is used with ESP IDF) seems inactive lately. As much as I would like to help there, I simply don't have the time, and since I prefer lighter-weight stuff, I tend to stick with async-io or my variation of it for now and don't have the impulse to hunt what is wrong there.

As for the polling crate working correctly for you - I'm not surprised. Putting aside the xtensa compiler bug and the annoyance of having to deal with stack overflows, async-io (and the polling crate underneath) does work.

@ivmarkov
Copy link
Collaborator

By the way - and please don't take it as an advertisement of my own crates - I have other work to do anyways :) - you might want to give the HTTP server of edge-net a whirl. It is a syntax sugar over the splendid httparse crate, plus some additions of mine around chunked-encoding support and a few other smaller things. And supports async-io(-mini) out of the box.

@ivmarkov
Copy link
Collaborator

Oh - and please use async-io-mini and edge-net directly from GIT for now. I can put updated versions of those on crates.io if there is enough interest (edge-net does have versions there, but they are old by now).

@fko-kuptec
Copy link
Author

First of all, thank you very much for being such a blessing. 😊

In the meantime, I got my own example also working with using mio directly, as long as I reregistered for the wanted event after every read or write operation (basically as with polling). I am, however, not sure whether mio is supposed to work like that. So the issue is either with a working but insufficient mio implementation, or the ESP backend is triggering some edge-case in tokio that no other OS does. The question is, what to do with that information, so that it at least does not get lost... 🤷‍♂️

Thank you for the explanation on the issues with async-io and your pointers to async-io-mini and edge-net. I will probably give them a try. Up until now I wanted to use picoserve, but this is more tailored towards embassy and tokio, even though I should be able to get it working with async-io-mini as well. We'll see. ^^

@fko-kuptec
Copy link
Author

This is so nice! edge-net even has DHCP server support 😁

@ivmarkov
Copy link
Collaborator

In the meantime, I got my own example also working with using mio directly, as long as I reregistered for the wanted event after every read or write operation (basically as with polling). I am, however, not sure whether mio is supposed to work like that. So the issue is either with a working but insufficient mio implementation, or the ESP backend is triggering some edge-case in tokio that no other OS does.

If I have to make a guess... it is the former (insufficient mio implementation OR the crate that sits on top of it) rather than the latter. Recollecting some memories... aside from my personal bias towards async-io, the other reason I concentrated there on upstreaming ESP IDF support is that the polling crate is agnostic w.r.t. the notion of Leveled (poll, select syscalls) vs Edge (epoll, kqueue syscalls) in that it kind of supports either and the Reactor on top has to deal with that. mio on the other hand, back in time supported only "edge" style triggering so to me it looked like a lot of lift and shift to implement support for the poll or select syscall in mio (which are level-based). The person who contributed the poll / ESP-IDF support to mio did exactly that I think (somehow support for "level" style triggering), but there's maybe still a bug or two lurking in there precisely because mio was all about "edge" triggering and "level" triggering was not originally supported.

The question is, what to do with that information, so that it at least does not get lost... 🤷‍♂️

Open a bug in the mio crate? Might be very useful esp. if you can reproduce on regular Linux (as it supports the poll syscall as well).

Thank you for the explanation on the issues with async-io and your pointers to async-io-mini and edge-net. I will probably give them a try. Up until now I wanted to use picoserve, but this is more tailored towards embassy and tokio, even though I should be able to get it working with async-io-mini as well. We'll see. ^^

picoserve might be a great option too, I just don't have any experience with it, as it somehow did happen at the same time as edge-net's HTTP code. But I see it is quickly gaining traction.

@ivmarkov
Copy link
Collaborator

Recollecting some memories... aside from my personal bias towards async-io, the other reason I concentrated there on upstreaming ESP IDF support is that the polling crate is agnostic w.r.t. the notion of Leveled (poll, select syscalls) vs Edge (epoll, kqueue syscalls) in that it kind of supports either and the Reactor on top has to deal with that. mio on the other hand, back in time supported only "edge" style triggering so to me it looked like a lot of lift and shift to implement support for the poll or select syscall in mio (which are level-based). The person who contributed the poll / ESP-IDF support to mio did exactly that I think (somehow support for "level" style triggering), but there's maybe still a bug or two lurking in there precisely because mio was all about "edge" triggering and "level" triggering was not originally supported.

Looking at the mio code, the poll support in mio is a copy paste of the poll support from the polling crate, so it should be correct. But then, whether the reactor on top behaves correctly with it... no idea...

@fko-kuptec
Copy link
Author

If I have to make a guess... it is the former (insufficient mio implementation OR the crate that sits on top of it) rather than the latter. Recollecting some memories... aside from my personal bias towards async-io, the other reason I concentrated there on upstreaming ESP IDF support is that the polling crate is agnostic w.r.t. the notion of Leveled (poll, select syscalls) vs Edge (epoll, kqueue syscalls) in that it kind of supports either and the Reactor on top has to deal with that. mio on the other hand, back in time supported only "edge" style triggering so to me it looked like a lot of lift and shift to implement support for the poll or select syscall in mio (which are level-based). The person who contributed the poll / ESP-IDF support to mio did exactly that I think (somehow support for "level" style triggering), but there's maybe still a bug or two lurking in there precisely because mio was all about "edge" triggering and "level" triggering was not originally supported.

Ah, I didn't know that there was a distinction between different event polling modes. Good to know, thank you :)

Open a bug in the mio crate? Might be very useful esp. if you can reproduce on regular Linux (as it supports the poll syscall as well).

I tried to reproduce that issue on linux yesterday, already, but without "luck". However, I did not explicitly enable the poll backend. I will give that a try and see.

Would you like to keep this issue open, to signal to other people, that tokio is currently not really working?

@fko-kuptec
Copy link
Author

I tried to reproduce that issue on linux yesterday, already, but without "luck". However, I did not explicitly enable the poll backend. I will give that a try and see.

I cannot reproduce it on Linux, at least not with my testcase

@ivmarkov
Copy link
Collaborator

I tried to reproduce that issue on linux yesterday, already, but without "luck". However, I did not explicitly enable the poll backend. I will give that a try and see.

You absolutely have to. It is a cfg I think that you have to pass to cargo and Rust specifically to instruct mio to use the poll syscall, even if there are better alternatives which would otherwise be used by default (i.e. epoll on Linux).

@fko-kuptec
Copy link
Author

For my tests today, I've already enabled it :)

@fko-kuptec
Copy link
Author

I can confirm that async TCP is now working with async-io-mini 👍

@ivmarkov
Copy link
Collaborator

ivmarkov commented Sep 6, 2024

Closing only because the original root cause is in mio. You might want to open an issue in their repo that refers to this one...

@ivmarkov ivmarkov closed this as completed Sep 6, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in esp-rs Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants