-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we unescape characters in path? #606
Comments
It sounds like a good idea to decode them, IMO. The latest HTTP semantics draft spec says:
(I believe this would also apply to at least the query) We already do the other HTTP-specific normalisations (removing default ports, root path instead of empty, lowercased host name), as well as other normalisations (e.g. exotic IP addresses), so I think it makes sense to do this, too. Some part of the system will have to - best to do it as soon as possible at the URL level to avoid mismatches like those you’ve described. |
Related issues: #369, #565 and #87. Quick questions, probably discussed there:
I think #87 gives the reasons for not doing this. It is still important to define the semantics of escape sequences, for server-side URL handling and for interoperability though. Currently the standard does not discuss that. |
Filed a bug in Chromium to track this. https://crbug.com/1252531 |
FWIW, I'm not sure I agree with the notion that we need to define the semantics. Or at least in such a way that there is only one valid interpretation. Servers can interpret URLs however they wish and they do not necessarily need to agree with each other on that. If one server considers |
What about the semantics of percent encoded sequences though? Not the additional protocol- or application specific normalisations, but the analogue of Percent-Encoding Normalization in the RFC. I don't think the URL Standard currently states that If we do make the semantics explicit then the question arises of how to correct for invalid escape sequences. So that |
We already define that We also add percent-encoding for certain characters - both in the parser and via the component setters. It logically follows that we expect anybody who processes the resulting URL to treat them as equivalent, otherwise we would have produced a URL which points to a different resource than the user intended. Basically, if we're not happy with saying that var url = new URL("http://example.com/foo/bar");
url.href; // "http://example.com/foo/bar"
url.pathname = "what should this do???";
url.href; // "http://example.com/what%20should%20this%20do%3F%3F%3F" Should also fail. Otherwise, the string I don't think that's workable. At least for ASCII code-points, they must be equivalent. |
Maybe? That really depends on whether the user knows what the parser will do. I think it's okay to say that for path/query/fragment we generally expect https://url.spec.whatwg.org/#string-percent-decode to work, but I'm not sure why we'd mandate things we cannot really require. If a server wants to treat |
I agree with Anne. Ideally we'd do as little processing as possible on a URL, and let the server handle them as well as they can. |
A good way to go about that is to point out in the standard that there are multiple normalisations/ equivalence relations on URLs that are in common use, and add a section, maybe? to explain that. I suppose, in the WHATWG style it would contain algorithms that compute them. (With a statement that they are not normative for browsers, where this is the case). Percent encoding normalisation / equivalence is a good one to start with. Collapsing slashes could be another one to mention. IIRC this is relevant also to another open issue about JS module specifiers. |
I suppose we could add something to https://url.spec.whatwg.org/#url-equivalence that is very clearly scoped to non-browser contexts. There's also query parameter order and such (to the extent we want to acknowledge query parameters there, not sure). |
Treating a feature like this as part of equivalence instead of canonical serialization would mean it definitely wouldn't apply to JS module system instance identification. For example, all JS module systems compare the |
@guybedford yeah, the same is true for many other systems out there. And to be clear, it wouldn't change the default equivalence algorithm, it would just be an option if you want more URLs to be the same (that also can reasonably be argued to the be same). |
The issue #565 which was just closed as duplicate has some additional discussion which might be worth checking out by those following this issue. But I agree on keeping any further discussion here. |
Also reading #369 again, it has an especially good description. That issue also explains the address bar behaviour of Firefox. My take is that we have to make a decision about how to handle invalid escape sequences. Not knowing/ deciding on how to handle those leads to the current situation by default rather than by making a deliberate decision. And if it ends up not being used by browsers, then it is still valuable to specify a recommended behaviour for other applications. To move on with #606 (comment), I think it is needed. And I think we have to revisit the reserved characters. Unless we’d go for a comparison on URL records with fully decoded components. But I’m not sure that is desirable, especially in the query string. By the way, I think adding something there is really great! |
Any news on this? The regression around actix/actix-web#2398 is sadly holding us back updating actix-web to the latest beta. svenstaro/miniserve#677 |
I'm currently exploring implementing this in Swift, as over-encoding/removing over-encoding is an important feature for interop with our existing RFC-2396 URL type, as well as a generally useful feature. Having looked a the previous issues, I'm reasonably convinced this is possible. I'm not seeing any insurmountable challenges.
I don't really find this very satisfying; the same argument could be made the other way. If the user is expected to have a deep and detailed understanding of the parser, any behaviour is reasonable and nothing needs to be justified. It's a kind of cyclical reasoning where things happen because they happen.
On the one hand, this is is demonstrably true because - well, form-encoding 😔. A On the other hand, at least for some characters in some components, that behaviour would not appear to be web compatible. Routers, caches and CDNs will sometimes decode these bytes, and expect that they do not change the meaning of the URL. The discussions in previous issues seems to indicate that many browsers very much do expect these to be equivalent. This leads to the idea that we need some kind of "unreserved set" (perhaps per-component). Such a server would serve different resources to different browsers for the same URL, which seems at-odds with the idea of interoperability or the web as a platform. The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server. If GHP is indeed entitled to behave that way, it suggests that all browsers which successfully navigate to that URL are wrong - which again, does not seem to be a web-compatible position.
The difference, IMO, is that the URL parser does not add or remove empty path components (any more! It used to do that to file URLs). It does, however, add and remove percent-encoding, meaning there is already implicit acceptance that doing so does not change the meaning of the URL. By definition, if the parser does something (e.g. turning
We are forced to accept that the web's model of URLs, as defined by the various browser implementations over the decades, includes this assumption that percent-encoding may be safely added or removed in certain circumstances, and that a standard which attempts to describe that model must define that process and the circumstances where it applies. |
I don't think it does? Browsers that do not behave like Firefox (and Safari behaves like Firefox so something changed since OP) would not be able to visit the 404 at I'm going to treat this as a clarification issue as per #606 (comment). PRs welcome. |
@annevk By "treat this as a clarification issue", you mean continuing to preserve percent-encoding of characters outside the RFC 3986 reserved set in https://url.spec.whatwg.org/#path-state ? That is permitted by the current HTTP semantics document (which does not require normalization but does make clear that such gratuitous encoding maintains the interpretation of a URI, i.e. it still identifies the same resource), although it does put more burden on user code that is trying to be robust. How would you feel about an issue requesting a |
Yeah. A new API seems fair (though see also https://whatwg.org/faq#adding-new-features and https://whatwg.org/working-mode#changes). Deciding on the semantics might be tricky, but hopefully we can figure something out. (Would be nice if that also tackled query params and such.) |
This CL is part of the URL interop 2023 effort. "Intent to Implement and Ship" is [1]. Currently, when Chrome parses a URL, it decodes percent-encoded ASCII characters in URL path. However, this behavior doesn't align with the URL Standard [2]. The CL fixes this behavior to retain percent-encoded ASCII characters in URL's path. Before: > const url = new URL("http://example.com/%41"); > url.href "http://example.com/A" After: > const url = new URL("http://example.com/%41"); > url.href "http://example.com/%41" Interoperability: - Chrome isn't compliant, while Firefox and Safari are compliant. - I've tested URL APIs in non-browser environments and libraries, such as Deno's `URL` implementation [3] and Rust's `url` crate [4], both of which are standard-compliant. Background: The existing behavior seems to be a result of past decisions. The comment in `url_canon_path.cc` states: > // This table was used to be designed to match exactly what IE did > // with the characters. Impact: Regarding implementation, web-exported URL APIs, GURL, and KURL share the same URL parser and canonicalization backend. Given that these URL classes are widely used both internally or externally, predicting all possible consequences and risks is challenging. Given the very low user metrics [5], we received approval to land [1], but with a kill switch in place. UMA: Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023) This number isn't specific to any particular use case and represents a an upper bound. The actual impact is likely lower. Interaction with web servers: Before: When a user enters "https://example.com/%41" in the address bar or clicks a link like <a href="https://example.com/%41">, Chrome sends "/A" to the server. After:: Chrome now sends "/%41" to the server, without decoding, similar to Safari and Firefox. Note that Chrome's address bar will still display "https://example.com/A" because the address bar formats URLs in their own way. For websites, how to handle percent-encoded characters in a URL's path is up to each website. Since they can receive such URLs from various clients, not just Chrome, this isn't a new issue for most websites. They typically decode a URL's path before processing. Another concern relates to Chromium's internal code or developers who rely on the current behavior, intentionally or not. For example, this CL might lead to issues in cases like: ``` const hash = {}; const url1 = new URL("http://example.com/%41"); hash[url1.href] = "v1"; // ... const url2 = new URL("http://example.com/A"); hash[url2.href] // Assumed that "v1" is retrieved, but this is no longer true. ``` According to the URL Standard, `url1` and `url2` are not equivalent [6], but some clients might depend on Chrome's current behavior as a feature. This presents a risk. Additional notes: - This change only affects the URL's path. Other parts like the host are not impacted. - There was a discussion about Chrome's behavior [7]. The consensus is that Chrome's behavior should be fixed for better interoperability. - There's a proposal to add a normalization interface [8] to URL. - [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ - [2] https://url.spec.whatwg.org/#url-parsing - [3] https://deno.land/api@v1.34.3?s=URL - [4] https://docs.rs/url/latest/url/ - [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62 - [6] https://url.spec.whatwg.org/#url-equivalence - [7] whatwg/url#606 - [8] whatwg/url#729 Bug: 1252531 Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744 Reviewed-by: Daniel Cheng <dcheng@chromium.org> Reviewed-by: James Lee <ljjlee@google.com> Reviewed-by: Avi Drissman <avi@chromium.org> Reviewed-by: Emily Stark <estark@chromium.org> Commit-Queue: Hayato Ito <hayato@chromium.org> Cr-Commit-Position: refs/heads/main@{#1191900}
This CL is part of the URL interop 2023 effort. "Intent to Implement and Ship" is [1]. Currently, when Chrome parses a URL, it decodes percent-encoded ASCII characters in URL path. However, this behavior doesn't align with the URL Standard [2]. The CL fixes this behavior to retain percent-encoded ASCII characters in URL's path. Before: > const url = new URL("http://example.com/%41"); > url.href "http://example.com/A" After: > const url = new URL("http://example.com/%41"); > url.href "http://example.com/%41" Interoperability: - Chrome isn't compliant, while Firefox and Safari are compliant. - I've tested URL APIs in non-browser environments and libraries, such as Deno's `URL` implementation [3] and Rust's `url` crate [4], both of which are standard-compliant. Background: The existing behavior seems to be a result of past decisions. The comment in `url_canon_path.cc` states: > // This table was used to be designed to match exactly what IE did > // with the characters. Impact: Regarding implementation, web-exported URL APIs, GURL, and KURL share the same URL parser and canonicalization backend. Given that these URL classes are widely used both internally or externally, predicting all possible consequences and risks is challenging. Given the very low user metrics [5], we received approval to land [1], but with a kill switch in place. UMA: Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023) This number isn't specific to any particular use case and represents a an upper bound. The actual impact is likely lower. Interaction with web servers: Before: When a user enters "https://example.com/%41" in the address bar or clicks a link like <a href="https://example.com/%41">, Chrome sends "/A" to the server. After:: Chrome now sends "/%41" to the server, without decoding, similar to Safari and Firefox. Note that Chrome's address bar will still display "https://example.com/A" because the address bar formats URLs in their own way. For websites, how to handle percent-encoded characters in a URL's path is up to each website. Since they can receive such URLs from various clients, not just Chrome, this isn't a new issue for most websites. They typically decode a URL's path before processing. Another concern relates to Chromium's internal code or developers who rely on the current behavior, intentionally or not. For example, this CL might lead to issues in cases like: ``` const hash = {}; const url1 = new URL("http://example.com/%41"); hash[url1.href] = "v1"; // ... const url2 = new URL("http://example.com/A"); hash[url2.href] // Assumed that "v1" is retrieved, but this is no longer true. ``` According to the URL Standard, `url1` and `url2` are not equivalent [6], but some clients might depend on Chrome's current behavior as a feature. This presents a risk. Additional notes: - This change only affects the URL's path. Other parts like the host are not impacted. - There was a discussion about Chrome's behavior [7]. The consensus is that Chrome's behavior should be fixed for better interoperability. - There's a proposal to add a normalization interface [8] to URL. - [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ - [2] https://url.spec.whatwg.org/#url-parsing - [3] https://deno.land/api@v1.34.3?s=URL - [4] https://docs.rs/url/latest/url/ - [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62 - [6] https://url.spec.whatwg.org/#url-equivalence - [7] whatwg/url#606 - [8] whatwg/url#729 Bug: 1252531 Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744 Reviewed-by: Daniel Cheng <dcheng@chromium.org> Reviewed-by: James Lee <ljjlee@google.com> Reviewed-by: Avi Drissman <avi@chromium.org> Reviewed-by: Emily Stark <estark@chromium.org> Commit-Queue: Hayato Ito <hayato@chromium.org> Cr-Commit-Position: refs/heads/main@{#1191900} NOKEYCHECK=True GitOrigin-RevId: 21947c4c384cd129c20862475364ec6d430998ea
Consider
In Chrome, both
a
elements have pathname "/whatwg-url/", indicating that the %61 was unescaped.In Safari, the second
a
element has pathname "/wh%61twg-url/", but when navigating,/whatwg-url/
is actually the destination.In Firefox, the second
a
element has pathname "/wh%61twg-url/", and that pathname is used when navigating (resulting in a 404 error with GitHub pages). However, confusingly, the address bar in the browser chrome has the valid "whatwg-url", so if you go to the address bar and press Enter it works:The spec currently doesn't do any unescaping. Should we?
The text was updated successfully, but these errors were encountered: