Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iroh-relay)!: implement authentication #3086

Merged
merged 6 commits into from
Jan 8, 2025
Merged

Conversation

dignifiedquire
Copy link
Contributor

@dignifiedquire dignifiedquire commented Jan 2, 2025

Description

Design RFC: https://www.notion.so/number-zero/Relay-Authentication-16f5df1306fb80ac9e31c1ccb04e026b?pvs=4

Breaking Changes

  • added: field access to iroh_relay::server::RelayConfig

Notes & open questions

  • This reuses the health frame in the relay protocol to indicate the authentcation issue. The frame was previously never sent in our code, but is now more explicitly interpreted to disconnect from the server.
  • Should the callback be async? If this does more complex things like access a DB we probably want this to be async. it is now async

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

Copy link

github-actions bot commented Jan 2, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3086/docs/iroh/

Last updated: 2025-01-08T19:43:22Z

Copy link

github-actions bot commented Jan 2, 2025

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: b63bd52

@dignifiedquire dignifiedquire changed the title feat(iroh-relay): implement authentication 1feat(iroh-relay): implement authentication Jan 2, 2025
@dignifiedquire dignifiedquire changed the title 1feat(iroh-relay): implement authentication feat(iroh-relay)!: implement authentication Jan 2, 2025
Copy link
Collaborator

@Arqu Arqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I missed the initial RFC.

Generally very happy with this structure. Would like to ask for some upgrades.

Currently there's no way to manipulate the allow/deny list without restarting. Can we have some sort of resolver mechanism like we had for certs that we can pass in so it can easily be extended? For now the dummy resolver can just use the same fixed construction with Vec<NodeId>.

FWIW Happy to follow up with another PR to add the above.

@dignifiedquire
Copy link
Contributor Author

Currently there's no way to manipulate the allow/deny list without restarting.

There is, just not in the main binary. The expectation is that this would need additonal changes to the binary anyway, so you would pass in a manual version of iroh_relay::server::AccessConfig::Restricted callback when doing that, when writing your own.

@matheus23
Copy link
Contributor

Should the callback be async? If this does more complex things like access a DB we probably want this to be async.

Yes I think one would really quickly want this.

/// Access is allowed.
Allow,
/// Access is denied.
Deny,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to store a Deny(String) here so we can answer with a helpful error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, if folks want to log this they already have a callback where they can do so.

@flub
Copy link
Contributor

flub commented Jan 3, 2025

Currently there's no way to manipulate the allow/deny list without restarting.

There is, just not in the main binary. The expectation is that this would need additonal changes to the binary anyway, so you would pass in a manual version of iroh_relay::server::AccessConfig::Restricted callback when doing that, when writing your own.

If you go to very fancy setups you'd push these kinds of things as config updates to live servers via an internal API without needing restarts.

I guess the simple "unixy" way to do this is re-read the config from disk on SIGHUP.

But yes, the current way the server does it by callback is flexible enough to cater for all that.

(haven't reviewed the PR yet really)

Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run this with the ActiveRelayActor? I guess currently it would receive the health message an just trace!("Ignoring {msg:?}"); it. After which a read error would occur because the connection is closed and it would go to exponetial backoff trying to re-connect.

Our exponential backoff maxes out at 5s. So pretty soon it would be settling in a pattern of connecting every 5s forever.

I guess that's ok, but not great. I've always found our backoff strategy rather aggressive, but didn't want to change it without reason. Maybe now is a time to think this through a bit better.

I'm not really sure what to suggest though. Because you also want clients to start working again if the server had a bad config deployed and then reverted. Maybe just up the max timeout in the backoff to something a little longer for now?

iroh-relay/src/main.rs Outdated Show resolved Hide resolved
iroh-relay/src/main.rs Show resolved Hide resolved
@@ -170,6 +171,49 @@ struct Config {
metrics_bind_addr: Option<SocketAddr>,
/// The capacity of the key cache.
key_cache_capacity: Option<usize>,
/// Access control
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we point out that this is only for the relay functionality? STUN, QAD and even the non-upgraded http services will keep working AFAIK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I don't think STUN can really be restricted, other than IP blocks. I don't know if there is anything for QAD that could be done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory you can restrict QAD because you have to accept a normal QUIC connection with TLS and all, with ALPN even in our case, before any QAD addresses are sent.

But I didn't mean we have to add this now, only to clarify that this doesn't protect those endpoints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd still be nice to expand on these docs. I know they don't end up anywhere visible because they're private items. But they're the only source we have for user-facing docs of the config format, so I've been treating them as user-facing docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some docs

if node_id == a_key {
Access::Deny
} else {
Access::Allow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not testing this path. I guess it's not really needed from a coverage viewpoint. But it being written with this specific if condition rather than a blanket deny makes it stand out and seem like you only implemented half of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dignifiedquire
Copy link
Contributor Author

I'm not really sure what to suggest though. Because you also want clients to start working again if the server had a bad config deployed and then reverted. Maybe just up the max timeout in the backoff to something a little longer for now?

Yeah, I honestly don't know how to solve this in a better fashion, though shouldn't we stop trying to connect, unless this is our home server?

@flub
Copy link
Contributor

flub commented Jan 7, 2025

I'm not really sure what to suggest though. Because you also want clients to start working again if the server had a bad config deployed and then reverted. Maybe just up the max timeout in the backoff to something a little longer for now?

Yeah, I honestly don't know how to solve this in a better fashion, though shouldn't we stop trying to connect, unless this is our home server?

We stop trying to connect when there has nothing been trying to send for 60s (unless for the home relay as you mention). Indeed a good point to call out. Because that means setting our ceiling much higher is also not useful.

@dignifiedquire dignifiedquire self-assigned this Jan 8, 2025
@dignifiedquire dignifiedquire added this to the v0.31.0 milestone Jan 8, 2025
@dignifiedquire
Copy link
Contributor Author

Should the callback be async? If this does more complex things like access a DB we probably want this to be async.

Yes I think one would really quickly want this.

done

@@ -170,6 +171,49 @@ struct Config {
metrics_bind_addr: Option<SocketAddr>,
/// The capacity of the key cache.
key_cache_capacity: Option<usize>,
/// Access control
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd still be nice to expand on these docs. I know they don't end up anywhere visible because they're private items. But they're the only source we have for user-facing docs of the config format, so I've been treating them as user-facing docs.

"
access.allowlist = [
\"{node_id}\",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of points out this is a rather whaky format. So I tried:

access = "everyone"
access.allowlist = [
   "{node_id}",
]

And well, this happily fails. Which is something at least I guess.

I'm a bit torn. But I guess this is fine to use TOML syntax like this. So if you like this I'm not objecting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is what I expected 😅

/// Access is allowed.
Allow,
/// Access is denied.
Deny,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, if folks want to log this they already have a callback where they can do so.

@dignifiedquire dignifiedquire added this pull request to the merge queue Jan 8, 2025
Merged via the queue into main with commit 2c42eff Jan 8, 2025
25 of 26 checks passed
@dignifiedquire dignifiedquire deleted the feat-relay-auth branch January 8, 2025 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

4 participants