Percent-decode more stuff? #87

SimonSapin · 2016-02-10T12:15:22Z

In #77, I was surprised that . was such a special case. It is indeed in Firefox 44, but in Chromium 48 other characters are also percent-decoded: -, _, ~, and ASCII alpha-numerics. (I don’t have other browsers to test right now.)

http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=3878

It may be worth testing other URL components too.

The text was updated successfully, but these errors were encountered:

annevk · 2016-02-10T12:55:22Z

I think it's a special case in most browsers from when I tested this last time around.

annevk · 2016-10-19T12:49:12Z

@achristensen07 what are your ideas here? I believe @valenting wanted to do more decoding in Gecko as well: https://bugzilla.mozilla.org/show_bug.cgi?id=1197123.

It's basically a choice between this document, Safari, and Firefox and Chrome and Edge. Arguably what Chrome and Edge do is somewhat nicer since there is more canonicalization, but it could lead to subtle issues too.

achristensen07 · 2016-10-20T02:31:34Z

I think I'm opposed to this change. alert(new URL("http://host/p%61th")); should show the %61, not 'a'. There are a lot of URL implementations based on old RFCs, and we should try to match their behavior as closely as possible. People sometimes want '%' in the path and often in the query, and we don't want people to have to put things like "%2525" in their URL for some user agents and not others.

annevk · 2016-10-20T13:15:36Z

@achristensen07 note that the old RFCs encourage normalizing these in the second paragraph of https://tools.ietf.org/html/rfc3986#section-2.3.

annevk · 2016-10-20T13:16:44Z

To be clear, not %25, but a lot of other ranges, including %61.

achristensen07 · 2016-10-20T16:47:32Z

Sure enough, Chrome and Edge percent decode more things in the path, but not the query. This can bee seen with alert(new URL("http://host/%45%25?%45%25")); We definitely shouldn't touch the query and all browsers seem to agree on this.
I disagree with this proposal and rfc3986 because I think a canonicalized URL should always be untouched when reparsed. With this proposal, "http://host/%%36%31" would be canonicalized to "http://host/%61" which when reparsed would become "http://host/a" which is bad. Right now Chrome percent-encodes the first '%' in "http://host/%%36%31" which is strange, and Edge throws an exception.
It is clear that something needs to change and that we need to unite behavior. Right now, it is impossible to send "GET /%2e HTTP/1.1" as the first line of an HTTP request and I think that also needs to change. I'm not sure what the solution should be, but percent-decoding more characters isn't it.

annevk · 2016-10-21T06:10:44Z

Paging @foolip to see if we can get Chrome's perspective on this.

annevk · 2016-10-21T06:11:18Z

@valenting I suggest that before changing Gecko here we try to bring this thread to a conclusion first.

achristensen07 · 2016-10-21T08:57:45Z

Is it common on the internet to see percent-encoded dots or other characters in the path? Why was section 2.3 added to rfc3986? Safari has not percent-decoded anything in the path, and I'm not aware of any compatibility issues, but it's possible I've missed something. A solution to this problem would be to just not percent-decode anything in the path, including %2e and %2E

gibson042 · 2016-10-22T06:12:13Z

Question: to which of the following should "http://host/%%36%31" be equivalent?

"http://host/%2561" (representing the non-decodable percent as a percent and recovering immediately)
"http://host/%25%25361" (representing the non-decodable percent as a percent and recovering after the character responsible for its non-decodability)
"http://host/%25%2536%2531" (giving up on percent decoding completely upon encountering a non-decodable percent)
nothing (it is invalid, as in the older RFCs)
other (e.g., "http://host/a") (please no)

achristensen07 · 2016-10-25T00:00:47Z

Safari and Firefox both leave "http://host/%%36%31" untouched when canonicalizing. They also agree on the behavior of URLs like "http://host/%2e" and I suggest we change the spec to match this. What is the rationale behind Chrome's percent-encoding only the first percent, and percent decoding some characters but not others? It leads to some confusing behavior.

valenting · 2016-10-25T00:24:37Z

They also agree on the behavior of URLs like "http://host/%2e" and I suggest we change the spec to match this.

We actually tried to change it in Firefox so that we decode %2e but it got backed out due to an implementation bug. :)

achristensen07 · 2016-10-25T00:28:36Z

I still think we should change the spec to not percent-decode anything in the path. Percent-decoding %2E makes it so there is only one thing that cannot be in the path. Percent-decoding more leads to strange reparsing behavior. Simpler is better, unless there's a good reason we need this.

https://bugs.webkit.org/show_bug.cgi?id=163929 Reviewed by Alexey Proskuryakov. LayoutTests/imported/w3c: * web-platform-tests/url/a-element-expected.txt: * web-platform-tests/url/a-element-xhtml-expected.txt: * web-platform-tests/url/url-constructor-expected.txt: Source/WebCore: Covered by updated API tests, which show that URLParser now matches URL::parse in these cases. Also covered by newly failing web platform tests, which were failing before URLParser was enabled. If whatwg/url#87 is resolved we can change behavior to match. * platform/URLParser.cpp: (WebCore::URLParser::isSingleDotPathSegment): (WebCore::URLParser::isDoubleDotPathSegment): (WebCore::URLParser::consumeSingleDotPathSegment): (WebCore::URLParser::consumeDoubleDotPathSegment): (WebCore::URLParser::parse): (WebCore::URLParser::isPercentEncodedDot): Deleted. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): LayoutTests: * fast/url/path-expected.txt: * fast/url/standard-url-expected.txt: * fetch/fetch-url-serialization-expected.txt: git-svn-id: http://svn.webkit.org/repository/webkit/trunk@207795 268f45cc-cd09-0410-ab3c-d52691b4dbfc

https://bugs.webkit.org/show_bug.cgi?id=163929 Reviewed by Alexey Proskuryakov. LayoutTests/imported/w3c: * web-platform-tests/url/a-element-expected.txt: * web-platform-tests/url/a-element-xhtml-expected.txt: * web-platform-tests/url/url-constructor-expected.txt: Source/WebCore: Covered by updated API tests, which show that URLParser now matches URL::parse in these cases. Also covered by newly failing web platform tests, which were failing before URLParser was enabled. If whatwg/url#87 is resolved we can change behavior to match. * platform/URLParser.cpp: (WebCore::URLParser::isSingleDotPathSegment): (WebCore::URLParser::isDoubleDotPathSegment): (WebCore::URLParser::consumeSingleDotPathSegment): (WebCore::URLParser::consumeDoubleDotPathSegment): (WebCore::URLParser::parse): (WebCore::URLParser::isPercentEncodedDot): Deleted. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): LayoutTests: * fast/url/path-expected.txt: * fast/url/standard-url-expected.txt: git-svn-id: http://svn.webkit.org/repository/webkit/trunk@207805 268f45cc-cd09-0410-ab3c-d52691b4dbfc

annevk · 2016-10-25T07:15:21Z

I don't understand. When I try new URL("https://example/%2e/test") I get https://example/test in Firefox (Nightly).

annevk · 2016-10-25T07:17:16Z

And not normalizing that seems dangerous for servers, since you might be able to trick the server to go outside the root of the site.

valenting · 2016-10-25T09:45:54Z

The bug was about %2e%2e%2e - or %2eX we don't normalize that.

achristensen07 · 2016-10-25T18:23:42Z

What if we kept the definition of double-dot path segments to protect servers and maybe single-dot path segments, but removed this:
If c is "%" and remaining, ASCII lowercased starts with "2e", append "." to buffer and increase pointer by two.

annevk · 2016-10-26T06:25:26Z

Yeah, that could work. I'd still really like to hear from @sleevi / @foolip.

foolip · 2016-10-26T11:37:55Z

I don't know Blink's KURL or what lies beneath well enough to judge what that change might mean. @sleevi?

Tests that exercise the difference might be useful, if they'd go from pass to fail or vice versa in Chrome it should be easier to understand where to look. https://github.com/w3c/web-platform-tests/tree/master/url is the right place I guess?

annevk · 2016-10-26T12:33:18Z

@foolip yes.

gibson042 · 2016-10-26T18:57:48Z

Safari and Firefox both leave "http://host/%%36%31" untouched when canonicalizing.

And therefore equivalent to "http://host/%25%2536%2531", right?

achristensen07 · 2016-10-27T02:45:02Z

No, none of those percents are encoded or decoded.

Also, I don't think it is adequate to protect servers by relying on the client getting rid of %2e%2e in the path. A server should have to protect itself from requests for ../../../etc. even from malicious clients that do not obey the URL spec. I don't see how parsing %2e%2e as .. protects servers.

annevk · 2016-10-27T06:21:01Z

@achristensen07 you cannot craft malicious requests like that from a browser (and therefore with the user's credentials).

sleevi · 2016-10-27T18:59:08Z

@annevk Sorry for the delay. The rationale for Chrome's current behaviour is, I believe, captured here in this comment block - https://cs.chromium.org/chromium/src/url/url_canon_path.cc?rcl=0&l=167

The reason for that change is explained in https://chromium.googlesource.com/chromium/src/+/6982036dc15ba85cdbf7339265f030ff7582ff3b - and the related crash.

So there are definitely implications to changing Chrome that make it harder to be an easy flip, but it's not to say it's unreasonable to explore changing.

gibson042 · 2016-10-28T05:34:36Z

No, none of those percents are encoded or decoded.

This project claims an explicit goal of replacing RFC 3986 and RFC 3987 by aligning with contemporary implementations, presumably affecting both clients and servers as those documents do. And since percent-decoding is already specified, "http://host/%36%31" is equivalent to "http://host/61". The question is what equivalence class (if any) "http://host/%%36%31" falls into (e.g., Chrome treats it as equivalent to "http://host/%2561", one option of at least four).

annevk · 2016-10-28T07:18:39Z

@gibson042 percent-decoding is not specified by the URL Standard (at least not to a significant extent). Replacing those RFCs does not mean that we'll prescribe what they prescribe.

annevk · 2016-10-28T09:56:13Z

It seems we avoid most issues here by simply not percent-decoding (other than host, where a one-time decode pass followed by IDNA is fine as it will return failure if anything fails).

My conclusion here is to embrace the suggestion in #87 (comment) to only work with double-dot path segments and single-dot path segments and not decode %2e generally (or anything else).

We need to make sure that is covered by the testsuite.

Gecko will need to WONTFIX https://bugzilla.mozilla.org/show_bug.cgi?id=1197123 (action item for @valenting I think).

Chrome and Edge will have to change their behavior. I think in the end that will turn out to be a drastic code simplification for them since they no longer have to worry about cases such as %%300.

annevk · 2016-10-31T16:36:46Z

@sleevi I'm not sure I follow. If there's no normalization during parsing, there's no normalization in the model, and there's no normalization exposed in the API. Where do you think normalization would happen?

sleevi · 2016-10-31T16:56:54Z

@annevk I'm suggesting that normalization is an intrinsic necessity of Chrome, at present, and as such, ends up reflected in the API we surface. While it's possible to suggest that's changable, it's not currently a high priority - that is, the normalization that occurs is an intrinsic dependency of many of Chromium's URL-handling subsystems. If we change the parsing phases to remove normalization, this creates more issues than it solves. Put differently, if normalization is removed, it's unlikely Chromium will be able to align on this point in any reasonable timeframe.

annevk · 2016-10-31T19:10:53Z

@sleevi could you perhaps explain why it's an intrinsic necessity in some detail? And I guess also, what exactly the normalization entails?

sleevi · 2016-10-31T19:15:50Z

@annevk It's important to understand, as mentioned before, that our URL code is shared across a variety of Google client platforms, and, even within Chrome, shared through a variety of layers - IPC, networking, UI (and, of that, security UI as a distinct subset), and of course, the web-visible exposure provided by the URL API.

While the URL API itself has no intrinsic needs that we're immediately aware of (short of checking compat issues), what is essential for a variety of checks and comparisons is that we have a consistent set of behaviours and presentations for URLs, and that inherently necessitates normalization, such that any two URLs, if they represent the same underlying URL, are treated equivalently. This behaviour has been extensively integrated into our code, because it was one of the motivations for the library itself.

Again, that's not to say we can't change, but it's certainly not a priority, given how significant an undertaking it would be. I think @gibson042's concerns are reasonable, and I suspect that many web developers would be surprised. I have trouble accepting @achristensen07's arguments against it.

To put the counter-argument forward: Why do you feel that normalization would be bad?

annevk · 2016-11-01T09:01:47Z

I'm not necessarily opposed, if we can find a reasonable processing model that everyone can agree on and doesn't lead to subtle issues with input such as %%300. I'm just trying to explore options.

And note that the URL API is not the only bit that's web-visible. HTTP requests, <a>, location, etc. are all part of the equation.

annevk · 2016-12-29T14:16:11Z

I think for now I'm proceeding with the plan on aligning the standard with what Firefox and Safari implement.

We don't have a credible alternative proposal and the current document is aligned with nobody.

And while RFC3986 does allow for additional normalization, it doesn't prescribe it, so I think either way we're roughly aligned with that. We could mention explicitly that servers are allowed to perform additional normalization and maybe should do that at some point. Or even expose an API to clients.

This reverts commit 47d2089. See whatwg/url#87 for discussion.

This basically reverts bee5ad8 as it did not end up shipping in Firefox and the tentative conclusion is that percent-decoding in URLs while parsing creates more trouble than it’s worth. We’ll still handle %2e when it’s part of a single-dot or double-dot path segment. Fixes #87.

This basically reverts bee5ad8 as it did not end up shipping in Firefox and the tentative conclusion is that percent-decoding in URLs while parsing creates more trouble than it’s worth. We’ll still handle %2e when it’s part of a single-dot or double-dot path segment. Tests: web-platform-tests/wpt#4104. Closes #87.

This reverts commit 47d2089 and also updates setter tests that were added later. See whatwg/url#87 for discussion and whatwg/url#156 for the URL Standard PR.

This basically reverts bee5ad8 as it did not end up shipping in Firefox and the tentative conclusion is that percent-decoding in URLs while parsing creates more trouble than it’s worth. We’ll still handle %2e when it’s part of a single-dot or double-dot path segment. Tests: web-platform-tests/wpt#4104. Closes #87.

Per a recent change to the URL spec, arbitrary %2e sequences in URL paths that are not single or double dot segments are not to be decoded. Refs: whatwg/url#87 Refs: whatwg/url#156 Refs: web-platform-tests/wpt@d93247d Fixes: nodejs#10598

Per a recent change to the URL spec, arbitrary %2e sequences in URL paths that are not single or double dot segments are not to be decoded. Refs: whatwg/url#87 Refs: whatwg/url#156 Refs: web-platform-tests/wpt@d93247d Fixes: #10598 PR-URL: #10602 Reviewed-By: Michal Zasso <targos@protonmail.com> Reviewed-By: Italo A. Casas <me@italoacasas.com> Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>

Per a recent change to the URL spec, arbitrary %2e sequences in URL paths that are not single or double dot segments are not to be decoded. Refs: whatwg/url#87 Refs: whatwg/url#156 Refs: web-platform-tests/wpt@d93247d Fixes: nodejs#10598 PR-URL: nodejs#10602 Reviewed-By: Michal Zasso <targos@protonmail.com> Reviewed-By: Italo A. Casas <me@italoacasas.com> Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>

https://bugs.webkit.org/show_bug.cgi?id=163929 Reviewed by Alexey Proskuryakov. LayoutTests/imported/w3c: * web-platform-tests/url/a-element-expected.txt: * web-platform-tests/url/a-element-xhtml-expected.txt: * web-platform-tests/url/url-constructor-expected.txt: Source/WebCore: Covered by updated API tests, which show that URLParser now matches URL::parse in these cases. Also covered by newly failing web platform tests, which were failing before URLParser was enabled. If whatwg/url#87 is resolved we can change behavior to match. * platform/URLParser.cpp: (WebCore::URLParser::isSingleDotPathSegment): (WebCore::URLParser::isDoubleDotPathSegment): (WebCore::URLParser::consumeSingleDotPathSegment): (WebCore::URLParser::consumeDoubleDotPathSegment): (WebCore::URLParser::parse): (WebCore::URLParser::isPercentEncodedDot): Deleted. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): LayoutTests: * fast/url/path-expected.txt: * fast/url/standard-url-expected.txt: * fetch/fetch-url-serialization-expected.txt: Canonical link: https://commits.webkit.org/181656@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@207795 268f45cc-cd09-0410-ab3c-d52691b4dbfc

https://bugs.webkit.org/show_bug.cgi?id=163929 Reviewed by Alexey Proskuryakov. LayoutTests/imported/w3c: * web-platform-tests/url/a-element-expected.txt: * web-platform-tests/url/a-element-xhtml-expected.txt: * web-platform-tests/url/url-constructor-expected.txt: Source/WebCore: Covered by updated API tests, which show that URLParser now matches URL::parse in these cases. Also covered by newly failing web platform tests, which were failing before URLParser was enabled. If whatwg/url#87 is resolved we can change behavior to match. * platform/URLParser.cpp: (WebCore::URLParser::isSingleDotPathSegment): (WebCore::URLParser::isDoubleDotPathSegment): (WebCore::URLParser::consumeSingleDotPathSegment): (WebCore::URLParser::consumeDoubleDotPathSegment): (WebCore::URLParser::parse): (WebCore::URLParser::isPercentEncodedDot): Deleted. Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): LayoutTests: * fast/url/path-expected.txt: * fast/url/standard-url-expected.txt: Canonical link: https://commits.webkit.org/181662@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@207805 268f45cc-cd09-0410-ab3c-d52691b4dbfc

SimonSapin mentioned this issue Feb 10, 2016

All instances of %2e in the path should be decoded into . #77

Closed

sleevi mentioned this issue Nov 4, 2016

What to do with /reference-implementation/ #162

Closed

annevk mentioned this issue Dec 8, 2016

Encode percent(%)s that are not part of a percent-encoded byte when parsing a URL #170

Closed

annevk added topic: parser and removed topic: parser labels Dec 20, 2016

annevk added a commit to web-platform-tests/wpt that referenced this issue Dec 29, 2016

Don't decode arbitrary %2e during URL parsing after all

a45fca2

This reverts commit 47d2089. See whatwg/url#87 for discussion.

annevk closed this as completed in #156 Jan 3, 2017

jasnell mentioned this issue Jan 4, 2017

url: do not decode arbitrary %2e sequences in paths nodejs/node#10602

Closed

3 tasks

domenic mentioned this issue Jan 16, 2017

Should 'C%3A' or 'C%7C' be interpreted as 'C:' and 'C|'? #210

Closed

achristensen07 mentioned this issue Mar 24, 2017

Single and double dot segments in paths cannot be represented – Premature decoding of %2e character sequences. #281

Closed

gibson042 mentioned this issue Oct 20, 2017

It's not immediately clear that "URL syntax" and "URL parser" conflict #118

Closed

rmisev mentioned this issue Jan 19, 2018

Need an "unreserved" character set (and better define how to percent-encode arbitrary strings) #369

Closed

alwinb mentioned this issue Jul 7, 2021

Should we unescape characters in path? #606

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Percent-decode more stuff? #87

Percent-decode more stuff? #87

SimonSapin commented Feb 10, 2016

annevk commented Feb 10, 2016

annevk commented Oct 19, 2016

achristensen07 commented Oct 20, 2016

annevk commented Oct 20, 2016

annevk commented Oct 20, 2016

achristensen07 commented Oct 20, 2016

annevk commented Oct 21, 2016

annevk commented Oct 21, 2016

achristensen07 commented Oct 21, 2016

gibson042 commented Oct 22, 2016

achristensen07 commented Oct 25, 2016

valenting commented Oct 25, 2016 •

edited

Loading

achristensen07 commented Oct 25, 2016

annevk commented Oct 25, 2016 •

edited

Loading

annevk commented Oct 25, 2016

valenting commented Oct 25, 2016

achristensen07 commented Oct 25, 2016

annevk commented Oct 26, 2016

foolip commented Oct 26, 2016

annevk commented Oct 26, 2016

gibson042 commented Oct 26, 2016

achristensen07 commented Oct 27, 2016

annevk commented Oct 27, 2016

sleevi commented Oct 27, 2016

gibson042 commented Oct 28, 2016

annevk commented Oct 28, 2016 •

edited

Loading

annevk commented Oct 28, 2016

annevk commented Oct 31, 2016

sleevi commented Oct 31, 2016

annevk commented Oct 31, 2016

sleevi commented Oct 31, 2016

annevk commented Nov 1, 2016

annevk commented Dec 29, 2016

Percent-decode more stuff? #87

Percent-decode more stuff? #87

Comments

SimonSapin commented Feb 10, 2016

annevk commented Feb 10, 2016

annevk commented Oct 19, 2016

achristensen07 commented Oct 20, 2016

annevk commented Oct 20, 2016

annevk commented Oct 20, 2016

achristensen07 commented Oct 20, 2016

annevk commented Oct 21, 2016

annevk commented Oct 21, 2016

achristensen07 commented Oct 21, 2016

gibson042 commented Oct 22, 2016

achristensen07 commented Oct 25, 2016

valenting commented Oct 25, 2016 • edited Loading

achristensen07 commented Oct 25, 2016

annevk commented Oct 25, 2016 • edited Loading

annevk commented Oct 25, 2016

valenting commented Oct 25, 2016

achristensen07 commented Oct 25, 2016

annevk commented Oct 26, 2016

foolip commented Oct 26, 2016

annevk commented Oct 26, 2016

gibson042 commented Oct 26, 2016

achristensen07 commented Oct 27, 2016

annevk commented Oct 27, 2016

sleevi commented Oct 27, 2016

gibson042 commented Oct 28, 2016

annevk commented Oct 28, 2016 • edited Loading

annevk commented Oct 28, 2016

annevk commented Oct 31, 2016

sleevi commented Oct 31, 2016

annevk commented Oct 31, 2016

sleevi commented Oct 31, 2016

annevk commented Nov 1, 2016

annevk commented Dec 29, 2016

valenting commented Oct 25, 2016 •

edited

Loading

annevk commented Oct 25, 2016 •

edited

Loading

annevk commented Oct 28, 2016 •

edited

Loading