Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3922: Removing SRV records from homeserver discovery #3922

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions proposals/3922-remove-srv-discovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# MSC3922: Removing SRV records from homeserver discovery
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for abundance of clarity: this MSC is currently extremely low on the priorities list, and is leaning towards rejection rather than acceptance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC just came up in #conduit:fachschaften.org, and as an SDK & power user client developer, this is something that I would absolutely love to see getting removed as I have functionality that eg. explicitly depends on /_matrix/federation/v1/version. Getting this in a browser is currently a big pain point.

Adding that endpoint to the Client-Server API would help, but that doesn't solve eg. being able to look up server keys or other future extensions to the federation APIs.


Currently when [resolving server names](https://spec.matrix.org/v1.4/server-server-api/#resolving-server-names),
homeservers (or any implementation trying to locate a server, such as integration managers or widgets
using [OpenID Connect validation](https://spec.matrix.org/v1.4/server-server-api/#openid)) must support
an ability to resolve SRV DNS records. Aside from this being difficult in the case of widgets (for example),
SRV records typically cause deployment issues due to them not working "as expected" by server administrators.

In addition to SRV records not "properly" supporting CNAMEs, TLS certificates are difficult to configure
correctly and often lead to issues with the wrong certificate being presented. These sorts of issues
come up often enough that [Synapse's documentation](https://matrix-org.github.io/synapse/v1.70/delegate.html#srv-dns-record-delegation)
doesn't even explain how to use SRV records, instead referencing the specification itself and citing that
.well-known is often what administrators are looking for. The documentation additionally calls it
"SRV delegation", further indicating that the use of SRV records is complex (it's not true delegation,
unlike what is possible with .well-known).

This proposal removes all reference of SRV records from the homeserver discovery specification, and
a plan to handle the rollout of such an invasive change.

## Proposal

In short, the [current rules](https://spec.matrix.org/v1.4/server-server-api/#resolving-server-names)
which reference SRV records are deleted. This leads to the following discovery mechanism:

*Note*: Some details, such as caching and certificate presentation, are excluded. They are unchanged.

1. If the hostname is an IP literal, then that IP address should be used. If a port number is given then
it should be used, otherwise using port 8448. The `Host` header in the request is set to the server name
(which is the IP address), with port number if explicitly given.
2. If the hostname is *not* an IP literal, but does have an explicit port, resolve the name using A or
AAAA records to an IP and use that with the explicit port. The `Host` header in the request is set to
the server name, with port number.
3. If the hostname is *not* an IP literal, a regular HTTPS request is made to the .well-known endpoint
on that domain. The hostname presented by this endpoint is called the "delegated hostname" and repeats
discovery steps 1 & 2 above. It does not repeat step 3 (this step) as that could cause infinite loops
or needless delays in discovery.
4. Server discovery fails and the server is presumed offline or invalid if it has not been resolved to
a usable IP and port by this step.

Clearly this would cause disruption in the larger ecosystem as some servers might still be using SRV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole I worry that this will create a significant headache for if the Matrix federation protocol changes sufficiently enough to no longer be HTTP-centric (which I really hope it will).

I have been advocating for some research into a binary federation protocol for ages now to get around the computational expense, wire bloat and signing difficulties that JSON has. This puts quite a roadblock in the way of that as it still mandates that the discovery part of the stack is HTTP+TLS even if no other part of the stack is.

DNS SRV might be awkward to configure but it's otherwise protocol-agnostic. I don't think it is wise at all to remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC is part 1 of a fairly long series tied in with the IETF/MIMI work we're doing. Specifically, we're aiming to separate transport from the protocol by defining a "Matrix over HTTP+JSON" thing, which would naturally include discovery. The discovery mechanism itself might still be http-centric, however this is not a requirement that the entire federation protocol be over http.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me like a different transport mechanism would necessitate a different discovery mechanism. SRV records don't provide a way for a server to say "actually you can talk to me over CoAP instead of HTTPs", so I don't really understand what benefit they provide in the current ecosystem.

On the other hand, one of the reasons SRV records currently suck is that they interact poorly with HTTPs. (Or at least, they don't interact in the intuitive way.) Everyone has expectations about how HTTPs and DNS interact, and SRV records don't follow the pattern.

With a different transport mechanism, perhaps those expectations will be different, in which case we can consider using SRV records for that transport. But I don't see why that means we need to keep SRV records for HTTP-based transports.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to point out that SRV is currently problematic for e.g. Dendrite and all other implementations using golang. Go does not try to handle domain name compression gracefully, when resolvers and/or authoritative NSes act against the RFC with TXT and SRV records. I've run into this issue with multiple HSes in the past when debugging obscure cases of broken federation. Some of the relevant details of the behaviour can be found e.g. here: golang/go#10622

Additionally I've seen cases where SRV works "by accident" because an admin didn't request separate certificates as they had originally planned to, thus having the required hostname included correctly while their intention was to use separate certificates instead. Point being, the SRV is hard to get right as it doesn't follow conventions familiar with the HTTPS.

Additional or differently formatted SRV record would be required for other than current HTTPS based federation service discovery anyway, so designing and adding such method back later is IMO rather obvious possibility, but not blocker for this proposal. Getting rid of SRV for HTTPS based federation would avoid many problems and time wasted, so the relevant consideration is, if all practical use-cases can be met with the well-known based discovery alone.

I would suggest adding configuration option to log HSes that were discovered over SRV, enabling easy gathering of real life usage information by participating HS admins. Additionally the log would help detecting those cases where Go dns resolution fails to serve its purpose and admin is debugging federations problems encountered by e.g. Dendrite.

Unfortunately the mis-handling of domain name compression can happen at either end, so this cannot be fully solved by adding test functionality to federationtester only. Properly working caching resolver can hide the issue, or badly behaving one could introduce it regardless of the correct response from the authoritative NS. Unfortunately I don't have statistics or further information of those NSes that were involved in those cases that I have had to debug. Golang has decided not to mitigate the issue unlike practically every other implementation has seemed to do, so yes, it's (still) the DNS, again.

records to identify themselves. Readers of this proposal are encouraged to proactively change over to
.well-known to identify if there are legitimate reasons for keeping SRV records, even if this proposal
is still in a draft/unapproved state.

In order to not cause massive breaking changes in the ecosystem, this proposal first deprecates SRV
discovery for a minimum of 1 calendar year from the time of the spec release itself. Afterwards, at the
discretion of the Spec Core Team (SCT), SRV discovery can be removed without notice.

Homeserver authors (Synapse, Dendrite, Conduit, etc) are encouraged to use the deprecation period to
help their users transition to .well-known discovery, and if reading this proposal before it is accepted
then to help identify any legitimate reason to keep SRV discovery in the specification (for example, a
user of theirs is completely unable to switch to .well-known - the case would be discussed to determine
if it's a reasonable blocker for this proposal).

The Matrix.org Foundation would also be engaged in helping users move over to .well-known through the
normal channels (blog posts, changelog on Synapse, social media, etc).

## Potential issues

As identified, this change could impact legitimate usage of SRV records for discovery - this proposal
exists to give readers, homeserver authors, etc time to identify these cases before the proposal is put
up for final comment period (thus proposing it for acceptance). If legitimate cases of SRV records are
found, this proposal may be declined or rejected (per normal process).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple nice things about SRV that we don't get (at least currently) from the .well-known method.

The first, and one that has always made me fret a bit about .well-known, is that you're relying on your front door web-server to be available. DNS has quite a lot of infrastructure (and institutional knowledge) to make it available in the event of outages. A lot of work has been done to make Synapse behave OK in the event of .well-known failures by using caching, but it is convoluted to get right (c.f the sheer number of constants involved). Whereas for DNS we get a lot of this for free (via caching resolvers).

Broadly though: it makes me a bit nervous to tie availability of federation to the availability of your front door web server (which is a natural target for e.g. DDoS and the like). Especially for smaller deployments, where it generally doesn't take much for their website to go down.

Secondly, SRV records have inbuilt support for returning multiple servers with different priorities and weights. This is not currently used by anyone (AFAIK), but may prove to be very helpful in the event we get a HA Matrix server. We see these options used heavily in SMTP land (via MX records), where you have your primary set of SMTP servers and then your backup set of servers in cases of severe outages of your primary set. This can be added to .well-known, but again you run the risk of re-implementing SRV records.

I do have sympathy for the argument that we shouldn't overly worry about this as its just not used currently, but personally I think we need to maintain half an eye on ensuring that it would be possible to implement HA in federation sanely.


Most of the above arguments were made when we introduced .well-known, and we generally considered that delegation was more important than those features. I do think its worth explicitly calling out the above as it means that server admins would no longer be able to take advantage of those features if they chose to.


## Alternatives

This may be a good time to design new discovery mechanisms, however that would have an even larger
impact on the ecosystem. Additionally, .well-known appears to be the (current) industry standard
for this mechanism.

## Security considerations

Removing SRV discovery could mean a higher rate of homeservers being delegated to third party providers
or being targets of takeover attempts, however given Synapse (the most populous homeserver implementation)
already strongly recommends .well-known over SRV, this issue is considered trivial in nature.

## Unstable prefix

No unstable prefix is possible for this proposal. Instead, a migration period is explicitly proposed
as an alternative.

## Dependencies

None relevant.