Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we unescape characters in path? #606

Open
TimothyGu opened this issue May 21, 2021 · 19 comments
Open

Should we unescape characters in path? #606

TimothyGu opened this issue May 21, 2021 · 19 comments
Labels
clarification Standard could be clearer

Comments

@TimothyGu
Copy link
Member

Consider

<a href="https://jsdom.github.io/whatwg-url/">normal</a>,
<a href="https://jsdom.github.io/wh%61twg-url/">encoded 'a'</a>

In Chrome, both a elements have pathname "/whatwg-url/", indicating that the %61 was unescaped.

In Safari, the second a element has pathname "/wh%61twg-url/", but when navigating, /whatwg-url/ is actually the destination.

In Firefox, the second a element has pathname "/wh%61twg-url/", and that pathname is used when navigating (resulting in a 404 error with GitHub pages). However, confusingly, the address bar in the browser chrome has the valid "whatwg-url", so if you go to the address bar and press Enter it works:

image

The spec currently doesn't do any unescaping. Should we?

@karwa
Copy link
Contributor

karwa commented May 21, 2021

It sounds like a good idea to decode them, IMO. The latest HTTP semantics draft spec says:

Scheme-based normalization (Section 6.2.3 of [RFC3986]) of "http" and "https" URIs involves the following additional rules:
...
Characters other than those in the "reserved" set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [RFC3986]).

(I believe this would also apply to at least the query)

We already do the other HTTP-specific normalisations (removing default ports, root path instead of empty, lowercased host name), as well as other normalisations (e.g. exotic IP addresses), so I think it makes sense to do this, too. Some part of the system will have to - best to do it as soon as possible at the URL level to avoid mismatches like those you’ve described.

@alwinb
Copy link
Contributor

alwinb commented Jul 7, 2021

Related issues: #369, #565 and #87. Quick questions, probably discussed there:

  • Which set of characters (code points?) would then be unescaped (aka. the unreserved set)
  • What then to do with invalid sequences becoming valid sequences after unescaping, such as e.g. http://example/%1%61

I think #87 gives the reasons for not doing this.

It is still important to define the semantics of escape sequences, for server-side URL handling and for interoperability though. Currently the standard does not discuss that.

@philloooo
Copy link

Filed a bug in Chromium to track this. https://crbug.com/1252531

@annevk
Copy link
Member

annevk commented Sep 24, 2021

FWIW, I'm not sure I agree with the notion that we need to define the semantics. Or at least in such a way that there is only one valid interpretation. Servers can interpret URLs however they wish and they do not necessarily need to agree with each other on that. If one server considers a and A the same and another does not, that's perfectly acceptable.

@alwinb
Copy link
Contributor

alwinb commented Sep 25, 2021

What about the semantics of percent encoded sequences though? Not the additional protocol- or application specific normalisations, but the analogue of Percent-Encoding Normalization in the RFC.

I don't think the URL Standard currently states that %61 may be considered equivalent to a, other than in the domain (where it is obligatory). It might even make sense to mandate that, so that it is safe to assume that e.g. http://example/%61 and http://example/a refer to the same resource, whether browsers normalise them to the same URL or not.

If we do make the semantics explicit then the question arises of how to correct for invalid escape sequences. So that %6%31 does not end up being (percent-encoding-) equivalent with %61 and then transitively with a.

@karwa
Copy link
Contributor

karwa commented Sep 25, 2021

We already define that %2E and %2e are equivalent to . in path components (even a mixed component like .%2E is treated equivalently to ..).

We also add percent-encoding for certain characters - both in the parser and via the component setters. It logically follows that we expect anybody who processes the resulting URL to treat them as equivalent, otherwise we would have produced a URL which points to a different resource than the user intended.

Basically, if we're not happy with saying that %61 must always be the same as a (and that is the only interpretation), then the following operation:

var url = new URL("http://example.com/foo/bar");

url.href; // "http://example.com/foo/bar"

url.pathname = "what should this do???";

url.href; // "http://example.com/what%20should%20this%20do%3F%3F%3F"

Should also fail. Otherwise, the string what%20should%20this%20do%3F%3F%3F is not necessarily the same as what should this do???.

I don't think that's workable. At least for ASCII code-points, they must be equivalent.

@annevk
Copy link
Member

annevk commented Sep 27, 2021

It logically follows that we expect anybody who processes the resulting URL to treat them as equivalent, otherwise we would have produced a URL which points to a different resource than the user intended.

Maybe? That really depends on whether the user knows what the parser will do.

I think it's okay to say that for path/query/fragment we generally expect https://url.spec.whatwg.org/#string-percent-decode to work, but I'm not sure why we'd mandate things we cannot really require. If a server wants to treat %61 and a differently, it can.

@valenting
Copy link
Collaborator

I agree with Anne. Ideally we'd do as little processing as possible on a URL, and let the server handle them as well as they can.
There are corner cases beyond percent encoding. For example http://example.com/path/to//file (two slashes) and http://example.com/path/to/file (one slash) are essentially equivalent from the filesystem's point of view, but depending on the web server you're using, they might not be. While the URL parser could say that we should collapse the two paths, it's probably more important that we keep the processing to a minimal in order to not change the URL's initial form.

@alwinb
Copy link
Contributor

alwinb commented Sep 29, 2021

I think it's okay to say that for path/query/fragment we generally expect https://url.spec.whatwg.org/#string-percent-decode to work

A good way to go about that is to point out in the standard that there are multiple normalisations/ equivalence relations on URLs that are in common use, and add a section, maybe? to explain that. I suppose, in the WHATWG style it would contain algorithms that compute them. (With a statement that they are not normative for browsers, where this is the case).

Percent encoding normalisation / equivalence is a good one to start with.

Collapsing slashes could be another one to mention. IIRC this is relevant also to another open issue about JS module specifiers.

@annevk
Copy link
Member

annevk commented Sep 29, 2021

I suppose we could add something to https://url.spec.whatwg.org/#url-equivalence that is very clearly scoped to non-browser contexts. There's also query parameter order and such (to the extent we want to acknowledge query parameters there, not sure).

@guybedford
Copy link

Treating a feature like this as part of equivalence instead of canonical serialization would mean it definitely wouldn't apply to JS module system instance identification. For example, all JS module systems compare the href not based on the equivalence algorithm and for this reason fragments result in separate module instances already, and users have already adopted this as a feature. Not sure which way I stand on that but just noting that these decisions should always be considered with reference to their modules implications at this point please.

@annevk
Copy link
Member

annevk commented Oct 8, 2021

@guybedford yeah, the same is true for many other systems out there. And to be clear, it wouldn't change the default equivalence algorithm, it would just be an option if you want more URLs to be the same (that also can reasonably be argued to the be same).

@malaire
Copy link

malaire commented Oct 15, 2021

The issue #565 which was just closed as duplicate has some additional discussion which might be worth checking out by those following this issue. But I agree on keeping any further discussion here.

@alwinb
Copy link
Contributor

alwinb commented Oct 24, 2021

Also reading #369 again, it has an especially good description. That issue also explains the address bar behaviour of Firefox.

My take is that we have to make a decision about how to handle invalid escape sequences. Not knowing/ deciding on how to handle those leads to the current situation by default rather than by making a deliberate decision. And if it ends up not being used by browsers, then it is still valuable to specify a recommended behaviour for other applications.

To move on with #606 (comment), I think it is needed. And I think we have to revisit the reserved characters. Unless we’d go for a comparison on URL records with fully decoded components. But I’m not sure that is desirable, especially in the query string.

By the way, I think adding something there is really great!

@Jikstra
Copy link

Jikstra commented Dec 28, 2021

Any news on this? The regression around actix/actix-web#2398 is sadly holding us back updating actix-web to the latest beta. svenstaro/miniserve#677

@karwa
Copy link
Contributor

karwa commented Jan 23, 2022

I'm currently exploring implementing this in Swift, as over-encoding/removing over-encoding is an important feature for interop with our existing RFC-2396 URL type, as well as a generally useful feature. Having looked a the previous issues, I'm reasonably convinced this is possible. I'm not seeing any insurmountable challenges.

Maybe? That really depends on whether the user knows what the parser will do.

I don't really find this very satisfying; the same argument could be made the other way. If the user is expected to have a deep and detailed understanding of the parser, any behaviour is reasonable and nothing needs to be justified. It's a kind of cyclical reasoning where things happen because they happen.

If a server wants to treat %61 and a differently, it can.

On the one hand, this is is demonstrably true because - well, form-encoding 😔. A + and a %2B may certainly be different depending on how the query is interpreted.

On the other hand, at least for some characters in some components, that behaviour would not appear to be web compatible. Routers, caches and CDNs will sometimes decode these bytes, and expect that they do not change the meaning of the URL. The discussions in previous issues seems to indicate that many browsers very much do expect these to be equivalent. This leads to the idea that we need some kind of "unreserved set" (perhaps per-component).

Such a server would serve different resources to different browsers for the same URL, which seems at-odds with the idea of interoperability or the web as a platform. The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server. If GHP is indeed entitled to behave that way, it suggests that all browsers which successfully navigate to that URL are wrong - which again, does not seem to be a web-compatible position.

There are corner cases beyond percent encoding. For example http://example.com/path/to//file (two slashes) and http://example.com/path/to/file (one slash) are essentially equivalent from the filesystem's point of view, but depending on the web server you're using, they might not be. While the URL parser could say that we should collapse the two paths, it's probably more important that we keep the processing to a minimal in order to not change the URL's initial form.

The difference, IMO, is that the URL parser does not add or remove empty path components (any more! It used to do that to file URLs). It does, however, add and remove percent-encoding, meaning there is already implicit acceptance that doing so does not change the meaning of the URL.

By definition, if the parser does something (e.g. turning http://ex%61mple.com -> http://example.com), it must preserve meaning, as any attempt to utter the former as a URL record results in the latter, and URLs are records:

A URL is a struct that represents a universal identifier. To disambiguate from a valid URL string it can also be referred to as a URL record.

We are forced to accept that the web's model of URLs, as defined by the various browser implementations over the decades, includes this assumption that percent-encoding may be safely added or removed in certain circumstances, and that a standard which attempts to describe that model must define that process and the circumstances where it applies.

@annevk annevk added clarification Standard could be clearer and removed topic: parser labels Dec 16, 2022
@annevk
Copy link
Member

annevk commented Dec 16, 2022

The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server.

I don't think it does? Browsers that do not behave like Firefox (and Safari behaves like Firefox so something changed since OP) would not be able to visit the 404 at https://jsdom.github.io/wh%61twg-url/ as they would instead get the resource at https://jsdom.github.io/whatwg-url/ which is different.

I'm going to treat this as a clarification issue as per #606 (comment). PRs welcome.

@gibson042
Copy link

@annevk By "treat this as a clarification issue", you mean continuing to preserve percent-encoding of characters outside the RFC 3986 reserved set in https://url.spec.whatwg.org/#path-state ? That is permitted by the current HTTP semantics document (which does not require normalization but does make clear that such gratuitous encoding maintains the interpretation of a URI, i.e. it still identifies the same resource), although it does put more burden on user code that is trying to be robust. How would you feel about an issue requesting a normalize method?

@annevk
Copy link
Member

annevk commented Dec 16, 2022

Yeah. A new API seems fair (though see also https://whatwg.org/faq#adding-new-features and https://whatwg.org/working-mode#changes). Deciding on the semantics might be tricky, but hopefully we can figure something out. (Would be nice if that also tackled query params and such.)

aarongable pushed a commit to chromium/chromium that referenced this issue Sep 4, 2023
This CL is part of the URL interop 2023 effort. "Intent to Implement
and Ship" is [1].

Currently, when Chrome parses a URL, it decodes percent-encoded ASCII
characters in URL path. However, this behavior doesn't align with the
URL Standard [2]. The CL fixes this behavior to retain percent-encoded
ASCII characters in URL's path.

Before:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/A"

After:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/%41"

Interoperability:

- Chrome isn't compliant, while Firefox and Safari are compliant.
- I've tested URL APIs in non-browser environments and libraries,
  such as Deno's `URL` implementation [3] and Rust's `url` crate [4],
  both of which are standard-compliant.

Background:

The existing behavior seems to be a result of past decisions. The
comment in `url_canon_path.cc` states:

> // This table was used to be designed to match exactly what IE did
> // with the characters.

Impact:

Regarding implementation, web-exported URL APIs, GURL, and KURL share
the same URL parser and canonicalization backend. Given that these URL
classes are widely used both internally or externally, predicting all
possible consequences and risks is challenging.

Given the very low user metrics [5], we received approval to land [1],
but with a kill switch in place.

UMA:

Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023)

This number isn't specific to any particular use case and represents a
an upper bound. The actual impact is likely lower.

Interaction with web servers:

Before:

When a user enters "https://example.com/%41" in the address bar or
clicks a link like <a href="https://example.com/%41">, Chrome sends
"/A" to the server.

After::

Chrome now sends "/%41" to the server, without decoding, similar to
Safari and Firefox. Note that Chrome's address bar will still display
"https://example.com/A" because the address bar formats URLs in their
own way.

For websites, how to handle percent-encoded characters in a URL's path
is up to each website. Since they can receive such URLs from various
clients, not just Chrome, this isn't a new issue for most websites.
They typically decode a URL's path before processing.

Another concern relates to Chromium's internal code or developers who
rely on the current behavior, intentionally or not.

For example, this CL might lead to issues in cases like:

```
const hash = {};

const url1 = new URL("http://example.com/%41");
hash[url1.href] = "v1";
// ...
const url2 = new URL("http://example.com/A");
hash[url2.href]  // Assumed that "v1" is retrieved, but this is no longer true.
```

According to the URL Standard, `url1` and `url2` are not equivalent
[6], but some clients might depend on Chrome's current behavior as a
feature. This presents a risk.

Additional notes:

- This change only affects the URL's path. Other parts like the host
  are not impacted.
- There was a discussion about Chrome's behavior [7]. The consensus is
  that Chrome's behavior should be fixed for better interoperability.
- There's a proposal to add a normalization interface [8] to URL.

- [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ
- [2] https://url.spec.whatwg.org/#url-parsing
- [3] https://deno.land/api@v1.34.3?s=URL
- [4] https://docs.rs/url/latest/url/
- [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62
- [6] https://url.spec.whatwg.org/#url-equivalence
- [7] whatwg/url#606
- [8] whatwg/url#729

Bug: 1252531
Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Reviewed-by: James Lee <ljjlee@google.com>
Reviewed-by: Avi Drissman <avi@chromium.org>
Reviewed-by: Emily Stark <estark@chromium.org>
Commit-Queue: Hayato Ito <hayato@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1191900}
ubonass pushed a commit to ubonass/google_url that referenced this issue Nov 10, 2024
This CL is part of the URL interop 2023 effort. "Intent to Implement
and Ship" is [1].

Currently, when Chrome parses a URL, it decodes percent-encoded ASCII
characters in URL path. However, this behavior doesn't align with the
URL Standard [2]. The CL fixes this behavior to retain percent-encoded
ASCII characters in URL's path.

Before:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/A"

After:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/%41"

Interoperability:

- Chrome isn't compliant, while Firefox and Safari are compliant.
- I've tested URL APIs in non-browser environments and libraries,
  such as Deno's `URL` implementation [3] and Rust's `url` crate [4],
  both of which are standard-compliant.

Background:

The existing behavior seems to be a result of past decisions. The
comment in `url_canon_path.cc` states:

> // This table was used to be designed to match exactly what IE did
> // with the characters.

Impact:

Regarding implementation, web-exported URL APIs, GURL, and KURL share
the same URL parser and canonicalization backend. Given that these URL
classes are widely used both internally or externally, predicting all
possible consequences and risks is challenging.

Given the very low user metrics [5], we received approval to land [1],
but with a kill switch in place.

UMA:

Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023)

This number isn't specific to any particular use case and represents a
an upper bound. The actual impact is likely lower.

Interaction with web servers:

Before:

When a user enters "https://example.com/%41" in the address bar or
clicks a link like <a href="https://example.com/%41">, Chrome sends
"/A" to the server.

After::

Chrome now sends "/%41" to the server, without decoding, similar to
Safari and Firefox. Note that Chrome's address bar will still display
"https://example.com/A" because the address bar formats URLs in their
own way.

For websites, how to handle percent-encoded characters in a URL's path
is up to each website. Since they can receive such URLs from various
clients, not just Chrome, this isn't a new issue for most websites.
They typically decode a URL's path before processing.

Another concern relates to Chromium's internal code or developers who
rely on the current behavior, intentionally or not.

For example, this CL might lead to issues in cases like:

```
const hash = {};

const url1 = new URL("http://example.com/%41");
hash[url1.href] = "v1";
// ...
const url2 = new URL("http://example.com/A");
hash[url2.href]  // Assumed that "v1" is retrieved, but this is no longer true.
```

According to the URL Standard, `url1` and `url2` are not equivalent
[6], but some clients might depend on Chrome's current behavior as a
feature. This presents a risk.

Additional notes:

- This change only affects the URL's path. Other parts like the host
  are not impacted.
- There was a discussion about Chrome's behavior [7]. The consensus is
  that Chrome's behavior should be fixed for better interoperability.
- There's a proposal to add a normalization interface [8] to URL.

- [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ
- [2] https://url.spec.whatwg.org/#url-parsing
- [3] https://deno.land/api@v1.34.3?s=URL
- [4] https://docs.rs/url/latest/url/
- [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62
- [6] https://url.spec.whatwg.org/#url-equivalence
- [7] whatwg/url#606
- [8] whatwg/url#729

Bug: 1252531
Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Reviewed-by: James Lee <ljjlee@google.com>
Reviewed-by: Avi Drissman <avi@chromium.org>
Reviewed-by: Emily Stark <estark@chromium.org>
Commit-Queue: Hayato Ito <hayato@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1191900}
NOKEYCHECK=True
GitOrigin-RevId: 21947c4c384cd129c20862475364ec6d430998ea
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer
Development

No branches or pull requests

10 participants