Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

underscores should be stripped from hostnames generated for apt config #3628

Closed
ubuntu-server-builder opened this issue May 12, 2023 · 13 comments
Labels
launchpad Migrated from Launchpad priority Fix soon

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1868232

Launchpad details
affected_projects = ['cloud-init (Ubuntu)', 'cloud-init (Ubuntu Focal)']
assignee = oddbloke
assignee_name = Dan Watkins
date_closed = 2020-05-09T03:02:07.675034+00:00
date_created = 2020-03-20T09:04:54.664199+00:00
date_fix_committed = 2020-04-15T21:55:49.862139+00:00
date_fix_released = 2020-05-09T03:02:07.675034+00:00
id = 1868232
importance = high
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1868232
milestone = None
owner = mthaddon
owner_name = Tom Haddon
private = False
status = fix_released
submitter = mthaddon
submitter_name = Tom Haddon
tags = []
duplicates = []

Launchpad user Tom Haddon(mthaddon) wrote on 2020-03-20T09:04:54.664199+00:00

In a ticket filed in the Ubuntu RT instance we were made aware of an issue where if a cloud is configured with an “” in the region name, cloud-init will generate an apt configuration that also includes that “” in the name.

So for example if the region name is zone_01, apt will be configured to use zone_01.clouds.archive.ubuntu.com.

On Friday March 13th we deployed some new archive servers on 18.04 using Apache 2.4.29-1ubuntu4.13. This version of apache has more strict protocol options than previous versions, per https://httpd.apache.org/docs/2.4/mod/core.html#httpprotocoloptions and the result is that a request to zone_01.clouds.archive.ubuntu.com returns a 400 Bad Request.

Could cloud-init be updated to remove non-permitted characters including “_” per https://tools.ietf.org/html/rfc3986#section-3.2.2 ?

@ubuntu-server-builder ubuntu-server-builder added launchpad Migrated from Launchpad priority Fix soon labels May 12, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2020-03-20T14:33:28.768520+00:00

The current behaviour is that cloud-init will use the patterns defined in /etc/cloud/cloud.cfg:

primary:

  • http://%(ec2_region)s.ec2.archive.ubuntu.com/ubuntu/
  • http://%(availability_zone)s.clouds.archive.ubuntu.com/ubuntu/
  • http://%(region)s.clouds.archive.ubuntu.com/ubuntu/

to determine the mirror to use. In this case, it will be one of the two latter patterns, depending on exactly how the data source in question presents "zone_01". Either way, the problem is the same.

Once this mirror URL is generated, cloud-init tests that it resolves before using it. This is where the problem lies: *.clouds.archive.ubuntu.com will always resolve, but the newly-deployed Apache servers will no longer serve every domain that resolves. Arguably this is a misconfiguration of the archive servers (why resolve something that you can't serve?), but cloud-init should handle this case gracefully regardless.

There are (at least) a couple of ways in which we could address this issue in cloud-init:

(a) rewrite the generated URL (or the variables which we are substituting into the pattern) to only include valid URI characters
(b) modify cloud-init to check that mirrors are accessible via HTTP (rather than simply resolvable)

While both of these would address the immediate issue, only implementing (b) would mean that all instances in such zones would fallback to using archive.ubuntu.com, so I think we should do some form of (a) regardless.

One obvious downside to (b) is that it will introduce an additional HTTP request to each boot on a Debian/Ubuntu host; this could be a concern both from a client boot speed perspective, but perhaps more importantly from a server load perspective. (My gut feel is that the cost in both cases wouldn't be significantly noticeable: most Debian/Ubuntu instances that come up will perform many HTTP requests to the archive hosts, so one additional one isn't likely to be noticed. We should consider this more deeply before we implement this, however.)

(As an aside, we should do some research to confirm that the non-ASCII encoding described in the linked RFC 3986 section won't be affected by our filtering. For example, if we currently rely on the libraries we use to convert non-ASCII hostnames to the defined percent-encoding, then we would regress non-ASCII hostnames by applying a naive filter before we pass the name to those libraries.)

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2020-03-20T15:51:32.599784+00:00

I've just done some research (by stepping through with a debugger), and socket.getaddrinfo does perform the encoding of non-ASCII characters:

In [7]: socket.getaddrinfo('www.\u2603.com', None)[0][4][0]
Out[7]: '185.53.178.7'

It does so using the 'idna' encoding:

In [2]: "www.☃.com".encode('idna')
Out[2]: b'www.xn--n3h.com'

which (unsurprisingly, given this bug) doesn't do anything to underscores:

In [4]: "www_foo.☃.com".encode('idna')
Out[4]: b'www_foo.xn--n3h.com'

So I believe the correct implementation of (a) would be to encode the URL ourselves, and then drop any invalid characters out. (We should check if there is any stdlib/requests functionality that already does this.)

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2020-03-23T16:15:04.922236+00:00

The cloud-init team chatted just now and we plan to implement (a), and reach out to the security and archive teams to confirm that they are comfortable with this path forward. I'll reiterate that proposal here for ease of review by those teams:

The Background:

To enable the provisioning of cloud-specific mirrors via only DNS changes, cloud-init configures apt with a .clouds.archive.ubuntu.com archive URL by default when launching Ubuntu instances. The specific "" name comes from the cloud (in slightly different ways depending on the cloud in question, but they all boil down to region/availability zone names). There is a *.clouds.archive.ubuntu.com DNS rule which resolves all of these archive URLs to the archive.ubuntu.com servers (in the absence of specific DNS rules for a region/cloud).

The Problem:

When instances are launched in regions/availability zones containing characters that are invalid in URIs (e.g. zone_01), cloud-init will currently generate and configure apt host names which are invalid URIs (e.g. zone_01.clouds.archive.ubuntu.com). These invalid URIs are now rejected by the archive.ubuntu.com web hosts (which were recently upgraded to bionic and its Apache), which means that new instances launched in these zones are configured with non-functional mirrors (the configured mirrors will return 400 for all requests).

The Proposed Solution:

cloud-init will remove invalid URI characters from the generated hostnames before configuring apt with them. (For example, zone_01.clouds.archive.ubuntu.com would be rewritten to zone01.clouds.archive.ubuntu.com.) Special care will be taken to ensure that IDNA names are not regressed.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Seth Arnold(seth-arnold) wrote on 2020-03-24T02:58:49.996545+00:00

It would be nice to address the wildcard DNS entries, as those have a potential for abuse, and can be endlessly confusing if you're not prepared for them.

In the meantime though, this plan sounds good to me.

I'm worried about collisions, where multiple providers may use us_west_2, us%west2, uswest~2, etc.

Some phrases read differently if the spacing is removed. The usual examples are powergen_italia and experts_exchange, but perhaps there's more realistic phrases for region names. (This seems quite small problem compared to the overall wildcard DNS entries, though.)

Reversible transformations are usually better but since we're presumably doing this with business partners, the trouble cases may fall under "don't do that" kinds of categories.

Are these actual problems? Probably it's fine but I thought I'd mention them just in case someone else with more context or creativity can make more of them.

Thanks

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Junien Fridrick(axino) wrote on 2020-03-24T08:02:28.839221+00:00

I do wonder about the actual risks of using "HttpProtocolOptions Unsafe" on the archive servers, which are only serving public, static files.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2020-03-24T14:10:45+00:00

On Tue, Mar 24, 2020 at 02:58:49AM -0000, Seth Arnold wrote:

It would be nice to address the wildcard DNS entries, as those have a
potential for abuse, and can be endlessly confusing if you're not
prepared for them.

The wildcard DNS entries are required for the system as-designed to
work, I believe. cloud-init will configure region-based mirror names
based only on the metadata available to it in the instance (so if
RandomCloud set up a eu-random-1 region, cloud-init would configure
eu-random-1.clouds.archive.ubuntu.com as the mirror for instances in
that new region). So we need some guarantee that all possible
region-named mirrors will be listened on, hence the wildcard DNS
entries.

(We generate per-region mirrors for, I believe, a couple of reasons.
Firstly, it allows us, or the cloud, to spin up in-region mirrors and
have them used by ~all already-deployed Ubuntu instances just by adding
non-wildcard DNS entries pointing at the new mirrors. And, secondly, it
means that clouds don't have an incentive to DNS-hijack
archive.ubuntu.com if they decide they want to host in-cloud mirrors, so
cloud users will have an easy way around the in-cloud mirrors if they
so desire.)

In the meantime though, this plan sounds good to me.

OK, good!

I'm worried about collisions, where multiple providers may use
us_west_2, us%west2, uswest~2, etc.

To some extent, we do have this problem to solve already, as clouds
could have regions named identically. That said, this certainly does
increase the chance of collision.

Some phrases read differently if the spacing is removed. The usual
examples are powergen_italia and experts_exchange, but perhaps there's
more realistic phrases for region names. (This seems quite small problem
compared to the overall wildcard DNS entries, though.)

Reversible transformations are usually better but since we're presumably
doing this with business partners, the trouble cases may fall under
"don't do that" kinds of categories.

It wouldn't be reversible, but we could convert invalid characters into,
say, "--" which is relatively unlikely to be used in real region
names/URIs. That would at least mean that "useast_1" and "useast1"
wouldn't collapse to the same mirror hostname (although it wouldn't do
anything about useast^1 and useast_1 colliding).

(I wonder if there are URI/hostname length boundaries that we would risk
running into if we replace single characters with anything more than a
single character, though.)

Are these actual problems? Probably it's fine but I thought I'd mention
them just in case someone else with more context or creativity can make
more of them.

The input is certainly welcome!

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dimitri John Ledkov(xnox) wrote on 2020-03-25T11:02:37.036902+00:00

in punnycode -- is the encoding of -, thus i wouldn't want to use that.

In [36]: "-☃.com".encode('idna')
Out[36]: b'xn---
-gsx.com'

I wish that idna module would do something sensible, and allow invalid chars to be encoded into xn-* string as unicode codepoints, but nothing seems to do that in python. Maybe time for an RFC update?!

Maybe it is worth talking to the cloud admins about using idna-safe zone names.

Most clouds use '-' for region name separators. We don't need round trip safety, as one can always query the zone name from the cloud-init metadata.

Thus cloud init imho, should be replacing any invalid chars with '-' (slightly better than dropping them, as the string length remains the same)

In terms of applying "HttpProtocolOptions Unsafe" on the cloud-mirror side, we should do this to support all the already launched instances, and those that are getting launched with cloud-init versions that generate unsafe urls, because otherwise those instances do not receive any updates. (I hope i understand the option right, and support strategy)

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2020-03-25T21:00:12.385015+00:00

in punnycode -- is the encoding of -, thus i wouldn't want to use that.

In [36]: "-☃.com".encode('idna')
Out[36]: b'xn---
-gsx.com'

This isn't the case; "xn--" is the prefix used to indicate a domain name label is encoded with Punycode. I don't believe Punycode modifies ASCII characters in its input (though it of course does change their position (and order relative to non-ASCII characters)):

In [17]: "-\u3061".encode('idna') == "a\u3061".encode('idna').replace(b'a', b'-')
Out[17]: True

Regardless, I think increasing the length of the string is asking for problems, so I think "--" would a mistake.

Replacing them with "-" also poses a problem: "-" isn't a valid character at the start or end of domain name labels; rewriting "foo_.example.com" to "foo-.example.com" isn't any more valid (and, in fact, as "-"s are explicitly disallowed there, I bet it's actually worse in practical terms as naive implementations may reject the latter but not the former).

I think I'm going to go for "replace non-LDH characters with a single dash, then strip leading/trailing dashes". (It won't be too costly to modify, so if someone has a better idea please let me know.)

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2020-03-26T23:14:26.834061+00:00

+1 on this case. I worry about extending hostnames too as that typically has been a problem in the past at 63 chars disappear quickly.
In reviewing the RFC on hostnames[1] it makes sense to single minus/dash replace non-conforming characters and make sure first and last hostname chars are stripped if '-'.

The only alternative I could think of was replacing a leading or trailing '-' with an arbitrary 0 to avoid collisions with other hostnames that don't start or end with '-' if we strip characters.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Paul Collins(pjdc) wrote on 2020-04-15T04:45:19.170283+00:00

We've made some changes to the archive.ubuntu.com environment, and now requests to e.g. under_score.clouds.archive.ubuntu.com are directed to a dedicated vhost that has "HTTPProtocolOptions unsafe" enabled.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2020-05-09T03:02:10.584272+00:00

This bug is believed to be fixed in cloud-init in version 20.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2020-05-13T22:38:19.088559+00:00

This bug was fixed in the package cloud-init - 20.2-20-gd10ce3ec-0ubuntu1


cloud-init (20.2-20-gd10ce3ec-0ubuntu1) groovy; urgency=medium

-- Chad Smith chad.smith@canonical.com Mon, 11 May 2020 20:17:06 -0600

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Seth Arnold(seth-arnold) wrote on 2020-05-20T23:50:28.112617+00:00

I believe this is fixed via c478d0b

I read through this patch and liked what I saw, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad priority Fix soon
Projects
None yet
Development

No branches or pull requests

1 participant