Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The relay send path of an Endpoint is too easily congested. #3008

Closed
PaulOlteanu opened this issue Dec 4, 2024 · 2 comments · Fixed by #3062
Closed

The relay send path of an Endpoint is too easily congested. #3008

PaulOlteanu opened this issue Dec 4, 2024 · 2 comments · Fixed by #3062
Assignees
Labels
bug Something isn't working c-iroh c-iroh-relay perf performance related issues

Comments

@PaulOlteanu
Copy link
Contributor

PaulOlteanu commented Dec 4, 2024

I'm occasionally seeing Endpoints get into a state where they're unable to connect to any other nodes, and the logs start showing this:

2024-12-04 04:31:18 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:18.002 -0500 EST debug "relay channel full, dropping call-me-maybe"
2024-12-04 04:31:19.926 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:19.928 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:19.929 -0500 EST debug "relay channel full, dropping call-me-maybe"
2024-12-04 04:31:19.931 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:20.926 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:20.929 -0500 EST debug "no UDP or relay paths available for node"
2024-12-04 04:31:20.93 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:20.932 -0500 EST debug "no UDP or relay paths available for node"
2024-12-04 04:31:22.924 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:22.925 -0500 EST debug "no UDP or relay paths available for node"
2024-12-04 04:31:22.926 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:22.927 -0500 EST debug "no UDP or relay paths available for node"
2024-12-04 04:31:22.999 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:23.001 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:23.003 -0500 EST debug "relay channel full, dropping call-me-maybe"
2024-12-04 04:31:23.004 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:23.006 -0500 EST debug "send relay: message dropped, channel to actor is full"
2024-12-04 04:31:23.007 -0500 EST debug "relay channel full, dropping call-me-maybe"

This continues for a few minutes and never seems to recover until the Endpoint is dropped and a new one is created, at which point things work fine again.

In this particular case the client was in India, connecting to a relay in the US (because we're only running one relay). Not sure if latency to the relay could affect this. I have seen this happen with a client from Canada so I don't think it would be exclusively due to high latency

@flub
Copy link
Contributor

flub commented Dec 5, 2024

Thanks for filing this. It is a kind of a dup to #2951 (I appreciate there's no way to tell this!) but I hadn't split the sender-side relay path into it's own issue yet like the others so will use this one to track it.

@flub flub self-assigned this Dec 5, 2024
@flub flub added bug Something isn't working perf performance related issues c-iroh-relay c-iroh labels Dec 5, 2024
@flub flub changed the title Endpoint seems to be unable to reach relay The relay send path of an Endpoint is too easily congested. Dec 5, 2024
@flub
Copy link
Contributor

flub commented Dec 5, 2024

#2971 is also very likely a dup of this. Though that also talks about the relay server not finding clients, which is likely either a red herring or a different issue.

github-merge-queue bot pushed a commit that referenced this issue Jan 3, 2025
…MagicSock (#3062)

## Description

This refactors how datagrams flow from the MagicSock (AsyncUdpSocket) to
relay server and back. It also vastly simplifies the actors involved in
communicating with a relay server.

- The `RelayActor` managed all connections to relay servers.
  - It starts a new `ActiveRelayActor` for each relay server needed.
  - The `ActiveRelayActor` will exit when unused.
    - Unless it is for the home relay, this one never exits.
  - Each `ActiveRelayActor` uses a relay `Client`.
- The relay `Client` is now a `Stream` and `Sink` directly connected to
the `TcpStream` connected to the relay server. This eliminates several
actors previously used here in the `Client` and `Conn`.
- Each `ActiveRelayActor` will try and maintain a connection with the
relay server.
- If connections fail, exponential backoff is used for reconnections.
- When `AsyncUdpSocket` needs to send datagrams:
  - It (now) puts them on a queue to the `RelayActor`.
- The `RelayActor` ensures the correct `ActiveRelayActor` is running and
forwards datagrams to it.
  - The `ActiveRelayActor` sends datagrams directly to the relay server.
- The relay receive path is now:
- Whenever `ActiveRelayActor` is connected it reads from the underlying
`TcpStream`.
- Received datagrams are placed on an mpsc channel that now bypasses the
`RelayActor` and goes straight to the `AsyncUpdSocket` interface.

Along the way many bugs are fixed.  Some of them:

- The relay datagrams send and receive queue now behave more correctly
when they are full. So the `AsyncUdpSocket` behaves better.
- Though there still is a bug with the send queue not waking up all the
tasks that might be waiting to send. This needs a followup: #3067.
- The `RelayActor` now avoids blocking. This means it can still react to
events when the datagrams queues are full and reconnect relay servers
etc as needed to unblock.
- The `ActiveRelayActor` also avoids blocking. Allowing it to react to
connection breakage and the need to reconnect at any time.
- The `ActiveRelayActor` now correctly handles connection errors and
retries with backoff.
- The `ActiveRleayActor` will no longer queue unsent datagrams forever,
but flush them every 400ms.
- This also stops the send queue into the `RelayActor` from completely
blocking.


## Breaking Changes

### iroh-relay

- `Conn` is no longer public.
- The `Client` is completely changed.  See module docs.

## Notes & open questions

- Potentially the relay `Conn` and `Client` don't need to be two
separate things now? Though Client is convenient as it only implements
one Sink interface, while Conn is also a Frame sink. This means on Conn
you often have to use very annoying syntax when calling things like
.flush() or .close() etc.
- Maybe a few items from the `ActiveRelayActor` can be moved back into
the relay `Client`, though that would probably require some gymnastics.
The current structure of `ActiveRelayActor` is fairly reasonable and
handles things correctly. Though it does have a lot of client knowledge
baked in. Being able to reason about the client as a stream + sink is
what enabled me to write the good `ActiveRelayActor` though, so I'm
fairly happy that this code makes sense as it is.

If all goes well this should:
Closes #3008 
Closes #2971 
Closes #2951

## Change checklist

- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.

---------

Co-authored-by: Friedel Ziegelmayer <me@dignifiedquire.com>
@flub flub closed this as completed in #3062 Jan 3, 2025
@github-project-automation github-project-automation bot moved this to ✅ Done in iroh Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c-iroh c-iroh-relay perf performance related issues
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants