Relay send path can overwrite wakers and might report `Poll::Pending` more than necessary #3067

matheus23 · 2024-12-20T10:42:28Z

2001616

This likely doesn't turn into a pressing problem in practice, but could become an issue eventually/could help improve performance.

A lot of the difficulty in fixing this is working around the API bottleneck that is the &self reference in the AsyncUdpSocket::try_send API.

The text was updated successfully, but these errors were encountered:

…MagicSock (#3062) ## Description This refactors how datagrams flow from the MagicSock (AsyncUdpSocket) to relay server and back. It also vastly simplifies the actors involved in communicating with a relay server. - The `RelayActor` managed all connections to relay servers. - It starts a new `ActiveRelayActor` for each relay server needed. - The `ActiveRelayActor` will exit when unused. - Unless it is for the home relay, this one never exits. - Each `ActiveRelayActor` uses a relay `Client`. - The relay `Client` is now a `Stream` and `Sink` directly connected to the `TcpStream` connected to the relay server. This eliminates several actors previously used here in the `Client` and `Conn`. - Each `ActiveRelayActor` will try and maintain a connection with the relay server. - If connections fail, exponential backoff is used for reconnections. - When `AsyncUdpSocket` needs to send datagrams: - It (now) puts them on a queue to the `RelayActor`. - The `RelayActor` ensures the correct `ActiveRelayActor` is running and forwards datagrams to it. - The `ActiveRelayActor` sends datagrams directly to the relay server. - The relay receive path is now: - Whenever `ActiveRelayActor` is connected it reads from the underlying `TcpStream`. - Received datagrams are placed on an mpsc channel that now bypasses the `RelayActor` and goes straight to the `AsyncUpdSocket` interface. Along the way many bugs are fixed. Some of them: - The relay datagrams send and receive queue now behave more correctly when they are full. So the `AsyncUdpSocket` behaves better. - Though there still is a bug with the send queue not waking up all the tasks that might be waiting to send. This needs a followup: #3067. - The `RelayActor` now avoids blocking. This means it can still react to events when the datagrams queues are full and reconnect relay servers etc as needed to unblock. - The `ActiveRelayActor` also avoids blocking. Allowing it to react to connection breakage and the need to reconnect at any time. - The `ActiveRelayActor` now correctly handles connection errors and retries with backoff. - The `ActiveRleayActor` will no longer queue unsent datagrams forever, but flush them every 400ms. - This also stops the send queue into the `RelayActor` from completely blocking. ## Breaking Changes ### iroh-relay - `Conn` is no longer public. - The `Client` is completely changed. See module docs. ## Notes & open questions - Potentially the relay `Conn` and `Client` don't need to be two separate things now? Though Client is convenient as it only implements one Sink interface, while Conn is also a Frame sink. This means on Conn you often have to use very annoying syntax when calling things like .flush() or .close() etc. - Maybe a few items from the `ActiveRelayActor` can be moved back into the relay `Client`, though that would probably require some gymnastics. The current structure of `ActiveRelayActor` is fairly reasonable and handles things correctly. Though it does have a lot of client knowledge baked in. Being able to reason about the client as a stream + sink is what enabled me to write the good `ActiveRelayActor` though, so I'm fairly happy that this code makes sense as it is. If all goes well this should: Closes #3008 Closes #2971 Closes #2951 ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] Tests if relevant. - [x] All breaking changes documented. --------- Co-authored-by: Friedel Ziegelmayer <me@dignifiedquire.com>

## Description  Closes #3067 Since multiple `Connection`s can have multiple `AsyncUdpSocket` `IpPollers`, it's incorrect to assume that a single `AtomicWaker` can wake up these tasks correctly. Instead, if there's only one `AtomicWaker` for all of them, only the last registered waker will be woken when there's capacity again. I'm changing this such that `poll_writable` will actually add the current task's waker to a list, if it hasn't been added yet. And successfully `recv`ing an item will wake all tasks. A second issue was that *between* the call to `self.sender.capacity()` (it being 0) and `self.waker.register`, another thread might have successfully `recv`d an item *and* called `self.waker.wake()`. This means that the first thread could've registered a waker, while the capacity is actually non-zero. Not terrible (it's very likely that it'll be woken up when the capacity jumps to 2 this time), but also not great. The fix is to re-check the capacity after registering the waker. The downside is that we keep the waker and thus potentially get a spurious wakeup, but that's fine. ## Notes & open questions  This... "fixes" the concerns, but I'd actually like to write a loom test for the concerns eventually. It's just... a *lot* of work, and I'd rather prioritize other things, but also improve this part of the code at the same time. ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [x] All breaking changes documented.

matheus23 added bug Something isn't working c-iroh labels Dec 20, 2024

github-project-automation bot added this to iroh Dec 20, 2024

matheus23 changed the title ~~Relay send path sets wakers incorrectly and might report Poll::Pending more than necessary~~ Relay send path can overwrite wakers and might report Poll::Pending more than necessary Dec 20, 2024

flub mentioned this issue Dec 20, 2024

fix(iroh, iroh-relay)!: Optimise the relay datagram path through the MagicSock #3062

Merged

4 tasks

matheus23 mentioned this issue Jan 10, 2025

fix: Try improving relay_datagram_send_channel() #3118

Merged

4 tasks

flub mentioned this issue Jan 10, 2025

fix(iroh): AsyncUdpSocket backpressure when connection type is *not* mixed #3117

Closed

2 tasks

matheus23 closed this as completed in #3118 Jan 10, 2025

github-project-automation bot moved this to ✅ Done in iroh Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relay send path can overwrite wakers and might report `Poll::Pending` more than necessary #3067

Relay send path can overwrite wakers and might report `Poll::Pending` more than necessary #3067

matheus23 commented Dec 20, 2024

Relay send path can overwrite wakers and might report Poll::Pending more than necessary #3067

Relay send path can overwrite wakers and might report Poll::Pending more than necessary #3067

Comments

matheus23 commented Dec 20, 2024

Relay send path can overwrite wakers and might report `Poll::Pending` more than necessary #3067

Relay send path can overwrite wakers and might report `Poll::Pending` more than necessary #3067