Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3922: Removing SRV records from homeserver discovery #3922

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

turt2live
Copy link
Member

@turt2live turt2live commented Nov 1, 2022

Rendered

Note: This is not eligible for FCP until enough time has passed to allow legitimate uses of SRV records to be identified. This is considered the implementation requirement for this MSC.

@turt2live turt2live changed the title Removing SRV records from homeserver discovery MSC3922: Removing SRV records from homeserver discovery Nov 1, 2022
@turt2live turt2live added proposal A matrix spec change proposal s2s Server-to-Server API (federation) kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. labels Nov 1, 2022
@turt2live turt2live marked this pull request as ready for review November 1, 2022 10:06
Copy link
Contributor

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(moved inline)

4. Server discovery fails and the server is presumed offline or invalid if it has not been resolved to
a usable IP and port by this step.

Clearly this would cause disruption in the larger ecosystem as some servers might still be using SRV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole I worry that this will create a significant headache for if the Matrix federation protocol changes sufficiently enough to no longer be HTTP-centric (which I really hope it will).

I have been advocating for some research into a binary federation protocol for ages now to get around the computational expense, wire bloat and signing difficulties that JSON has. This puts quite a roadblock in the way of that as it still mandates that the discovery part of the stack is HTTP+TLS even if no other part of the stack is.

DNS SRV might be awkward to configure but it's otherwise protocol-agnostic. I don't think it is wise at all to remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC is part 1 of a fairly long series tied in with the IETF/MIMI work we're doing. Specifically, we're aiming to separate transport from the protocol by defining a "Matrix over HTTP+JSON" thing, which would naturally include discovery. The discovery mechanism itself might still be http-centric, however this is not a requirement that the entire federation protocol be over http.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me like a different transport mechanism would necessitate a different discovery mechanism. SRV records don't provide a way for a server to say "actually you can talk to me over CoAP instead of HTTPs", so I don't really understand what benefit they provide in the current ecosystem.

On the other hand, one of the reasons SRV records currently suck is that they interact poorly with HTTPs. (Or at least, they don't interact in the intuitive way.) Everyone has expectations about how HTTPs and DNS interact, and SRV records don't follow the pattern.

With a different transport mechanism, perhaps those expectations will be different, in which case we can consider using SRV records for that transport. But I don't see why that means we need to keep SRV records for HTTP-based transports.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to point out that SRV is currently problematic for e.g. Dendrite and all other implementations using golang. Go does not try to handle domain name compression gracefully, when resolvers and/or authoritative NSes act against the RFC with TXT and SRV records. I've run into this issue with multiple HSes in the past when debugging obscure cases of broken federation. Some of the relevant details of the behaviour can be found e.g. here: golang/go#10622

Additionally I've seen cases where SRV works "by accident" because an admin didn't request separate certificates as they had originally planned to, thus having the required hostname included correctly while their intention was to use separate certificates instead. Point being, the SRV is hard to get right as it doesn't follow conventions familiar with the HTTPS.

Additional or differently formatted SRV record would be required for other than current HTTPS based federation service discovery anyway, so designing and adding such method back later is IMO rather obvious possibility, but not blocker for this proposal. Getting rid of SRV for HTTPS based federation would avoid many problems and time wasted, so the relevant consideration is, if all practical use-cases can be met with the well-known based discovery alone.

I would suggest adding configuration option to log HSes that were discovered over SRV, enabling easy gathering of real life usage information by participating HS admins. Additionally the log would help detecting those cases where Go dns resolution fails to serve its purpose and admin is debugging federations problems encountered by e.g. Dendrite.

Unfortunately the mis-handling of domain name compression can happen at either end, so this cannot be fully solved by adding test functionality to federationtester only. Properly working caching resolver can hide the issue, or badly behaving one could introduce it regardless of the correct response from the authoritative NS. Unfortunately I don't have statistics or further information of those NSes that were involved in those cases that I have had to debug. Golang has decided not to mitigate the issue unlike practically every other implementation has seemed to do, so yes, it's (still) the DNS, again.

Co-authored-by: Catalan Lover <48515417+FSG-Cat@users.noreply.github.com>
As identified, this change could impact legitimate usage of SRV records for discovery - this proposal
exists to give readers, homeserver authors, etc time to identify these cases before the proposal is put
up for final comment period (thus proposing it for acceptance). If legitimate cases of SRV records are
found, this proposal may be declined or rejected (per normal process).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple nice things about SRV that we don't get (at least currently) from the .well-known method.

The first, and one that has always made me fret a bit about .well-known, is that you're relying on your front door web-server to be available. DNS has quite a lot of infrastructure (and institutional knowledge) to make it available in the event of outages. A lot of work has been done to make Synapse behave OK in the event of .well-known failures by using caching, but it is convoluted to get right (c.f the sheer number of constants involved). Whereas for DNS we get a lot of this for free (via caching resolvers).

Broadly though: it makes me a bit nervous to tie availability of federation to the availability of your front door web server (which is a natural target for e.g. DDoS and the like). Especially for smaller deployments, where it generally doesn't take much for their website to go down.

Secondly, SRV records have inbuilt support for returning multiple servers with different priorities and weights. This is not currently used by anyone (AFAIK), but may prove to be very helpful in the event we get a HA Matrix server. We see these options used heavily in SMTP land (via MX records), where you have your primary set of SMTP servers and then your backup set of servers in cases of severe outages of your primary set. This can be added to .well-known, but again you run the risk of re-implementing SRV records.

I do have sympathy for the argument that we shouldn't overly worry about this as its just not used currently, but personally I think we need to maintain half an eye on ensuring that it would be possible to implement HA in federation sanely.


Most of the above arguments were made when we introduced .well-known, and we generally considered that delegation was more important than those features. I do think its worth explicitly calling out the above as it means that server admins would no longer be able to take advantage of those features if they chose to.

@Saklad5

This comment was marked as duplicate.

@Saklad5

This comment was marked as duplicate.

@Saklad5

This comment was marked as duplicate.

@turt2live

This comment was marked as off-topic.

@Saklad5

This comment was marked as off-topic.

@turt2live

This comment was marked as off-topic.

Copy link

@Saklad5 Saklad5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using SRV records myself, mainly because HTTPS records are still a draft spec and not widely implemented.

I strongly encourage replacing this with HTTPS records before removing them.


Here's my current setup, which relies on HTTPS records for client-server interaction (my client supports them) and SRV records for server-server interaction.

I have fallback A/AAAA records for <saklad5.com>, but for this example I'm assuming/pretending they are not being relied on. Those aside, I also rely on HTTPS records for the well-known URIs.


Resolution

;; QUESTION SECTION:
;saklad5.com.			IN	HTTPS

;; ANSWER SECTION:
saklad5.com.		172800	IN	HTTPS	1 xuahkwjssci42ywuenj5zvn5jdm4o5zcgrqqhbs25sd75dhmz6yyvmqd.onion. alpn="h2"
saklad5.com.		172800	IN	HTTPS	2 auk.saklad5.com. alpn="h3,h2"

;; ADDITIONAL SECTION:
auk.saklad5.com.	172800	IN	A	159.223.125.173
auk.saklad5.com.	172800	IN	AAAA	2604:a880:400:d0::2204:1
auk.saklad5.com.	172800	IN	HTTPS	0 .

Client

https://saklad5.com/.well-known/matrix/client:

{"m.homeserver":{"base_url":"https://matrix.saklad5.com"}}
;; QUESTION SECTION:
;matrix.saklad5.com.		IN	HTTPS

;; ANSWER SECTION:
matrix.saklad5.com.	86400	IN	HTTPS	1 45ebt6an43rs6btnrgwjlgrpohjgmvjftwvojzbaz542e2ml3cmrmzqd.onion. alpn="h2"
matrix.saklad5.com.	86400	IN	HTTPS	2 dove.saklad5.com. alpn="h2"

;; ADDITIONAL SECTION:
dove.saklad5.com.	172800	IN	A	45.33.37.128
dove.saklad5.com.	172800	IN	AAAA	2600:3c01::f03c:93ff:fe6a:83b3
dove.saklad5.com.	172800	IN	HTTPS	0 .

Server

https://saklad5.com/.well-known/matrix/server:

{"m.server":"matrix.saklad5.com"}
;; QUESTION SECTION:
;_matrix._tcp.matrix.saklad5.com. IN	SRV

;; ANSWER SECTION:
_matrix._tcp.matrix.saklad5.com. 86400 IN SRV	0 0 443 45ebt6an43rs6btnrgwjlgrpohjgmvjftwvojzbaz542e2ml3cmrmzqd.onion.
_matrix._tcp.matrix.saklad5.com. 86400 IN SRV	1 0 443 dove.saklad5.com.

;; ADDITIONAL SECTION:
dove.saklad5.com.	172800	IN	A	45.33.37.128
dove.saklad5.com.	172800	IN	AAAA	2600:3c01::f03c:93ff:fe6a:83b3

I do encourage replacing SRV records (and well-known URIs1) with SVBC/HTTPS records: among other benefits, they aren't tied to TCP specifically. This is not a small issue, as servers with both HTTP/2 and HTTP/3 will be listening on UDP as well.

I also encourage including DANE support in the spec, since not everyone can use DNS-01 ACME challenges like I'm doing for <matrix.saklad5.com>. An ACME challenge that acknowledges SVBC/HTTPS records will undoubtably be developed, but that's hardly the only benefit DANE has for a setup like this.

The equivalent of DANE for onion services, by the way, is quite simple: they're inherently trustworthy2 by virtue of including their key in their address, so a DNSSEC-secured record pointing to one is just as secure as if it pointed to a clearnet address with a TLSA record.

Footnotes

  1. I originally tried setting up my homeserver without any well-known URIs at all, giving it a certificate for <saklad5.com>. This quickly ran into an issue: manually setting up Matrix clients resulted in them demanding a certificate for <dove.saklad5.com>, and I'm trying to maintain strict separation between URIs and URLs here.

    Since Matrix doesn't define support for specialized HTTPS records (yet), that meant that I either had to move my homeserver to <auk.saklad5.com> (unacceptable), add an explicit port to my Matrix address and use the resulting distinct HTTPS records (also unacceptable), or use well-known URIs. Since I was doing it anyway, I figured I may as well point other servers there and save myself the trouble of keeping certificates for both <saklad5.com> and <matrix.saklad5.com> on the server.

  2. In the domain-validated sense, I mean. Just as a domain-validated/TLSA certificate proves that you are connecting to the owner of a clearnet address, an onion address proves you are connecting to the owner of the private key associated with that address.

@@ -0,0 +1,84 @@
# MSC3922: Removing SRV records from homeserver discovery
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for abundance of clarity: this MSC is currently extremely low on the priorities list, and is leaning towards rejection rather than acceptance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC just came up in #conduit:fachschaften.org, and as an SDK & power user client developer, this is something that I would absolutely love to see getting removed as I have functionality that eg. explicitly depends on /_matrix/federation/v1/version. Getting this in a browser is currently a big pain point.

Adding that endpoint to the Client-Server API would help, but that doesn't solve eg. being able to look up server keys or other future extensions to the federation APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal s2s Server-to-Server API (federation)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants