Change query state slightly to better deal with non-UTF-8 encodings #386

annevk · 2018-05-09T08:55:51Z

If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;.

The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation.

Tests: [we need to make a number of test changes for this]

Preview | Diff

annevk · 2018-05-09T08:57:12Z

This ends up fixing whatwg/encoding#139. https://bugs.chromium.org/p/chromium/issues/detail?id=795733 can be closed.

cc @rakuco @hsivonen @valenting @domenic

annevk · 2018-05-09T08:59:20Z

cc @inexorabletash

annevk · 2018-05-09T09:06:30Z

url.bs


       <li>
-        <p>For each <var>byte</var> in <var>buffer</var>:
+        <p>If <var>bytes</var> starts with 0x26 (&amp;) 0x23 (#) and ends with 0x3B (;), then:


Maybe this should be "starts with `&#`" for consistency.

@rakuco

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

annevk · 2018-05-09T10:21:27Z

Bugs:

annevk · 2018-05-09T10:25:52Z

I'm somewhat hoping this doesn't affect Node.js (I think it should always use UTF-8), but please verify @jasnell.

@rakuco

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

@rakuco

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

TimothyGu · 2018-05-19T01:03:24Z

In Node.js URL parsing always uses UTF-8, that's correct.

rmisev

I think algorithm is correct (I didn't test it). Although all characters which must be encoded (&# and ;) are at known positions, so there no need to scan bytes. I would change the 4 and 5 sub-steps as follows:

If bytes starts with `&#` and ends with 0x3B (;), then:
1. Replace the beginning `&#` in bytes with `%26%23`.
2. Replace the ending 0x3B (;) in bytes with `%3B`.
3. Append isomorphic decoded bytes to url’s query.
Otherwise, for each byte in bytes:
...

annevk · 2018-05-22T09:30:30Z

Thanks @rmisev, adopted your suggestion with some slight tweaks to the wording.

@rakuco

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

annevk · 2018-05-22T09:37:43Z

Given that this now has had review, browser bugs are filed, and test changes have been reviewed as well, I plan on landing this tomorrow unless I hear concerns before that time.

It's blocking further test refactoring efforts somewhat so it'd be good to get this over with.

If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;. The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation. Tests: web-platform-tests/wpt#10915. Fixes whatwg/encoding#139.

@rakuco

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

@rakuco

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

hsivonen

What was merged looks correct to me. The formulation looks like implementations are going to want to transform it to be more efficient in the context of real-world encoding APIs. In the future, we might want to add an informative note advising how to do that correctly.

annevk · 2018-05-23T09:21:52Z

Yeah, though I wouldn't mind non-UTF-8-performance penalties.

@rakuco

…, a=testonly Automatic update from web-platform-testsURL/Encoding: change query state parsing See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.) -- wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30 wpt-pr: 10915

…pable code points in URL query state. r=valentin Spec change: whatwg/url#386 MozReview-Commit-ID: Fa84kCNghtU Differential Revision: https://phabricator.services.mozilla.com/D8728 --HG-- extra : moz-landing-system : lando

…, a=testonly Automatic update from web-platform-testsURL/Encoding: change query state parsing See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to rakuco's work on trying to align Chrome with the earlier iteration of the specification.) -- wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30 wpt-pr: 10915 UltraBlame original commit: 13f3705568922e770ec97af2aad3e09e0449caa6

…pable code points in URL query state. r=valentin Spec change: whatwg/url#386 MozReview-Commit-ID: Fa84kCNghtU Differential Revision: https://phabricator.services.mozilla.com/D8728 UltraBlame original commit: 948a4673220c961438955f1c1346ee68e3dd8ff4

Editorial: avoid setting encoding multiple times

fb623cf

annevk added topic: parser needs tests Moving the issue forward requires someone to write tests labels May 9, 2018

annevk mentioned this pull request May 9, 2018

"html" error mode somewhat incompatible with URLs whatwg/encoding#139

Closed

annevk requested review from TimothyGu and rmisev May 9, 2018 09:02

annevk commented May 9, 2018

View reviewed changes

annevk mentioned this pull request May 9, 2018

URL/Encoding: change query state parsing web-platform-tests/wpt#10915

Merged

annevk mentioned this pull request May 9, 2018

Extract Location object tests from query-encoding/ web-platform-tests/wpt#10891

Merged

annevk requested a review from domenic May 17, 2018 07:30

annevk requested a review from hsivonen May 20, 2018 07:59

rmisev reviewed May 22, 2018

View reviewed changes

annevk force-pushed the annevk/query-state-revamp branch from 3b4992f to 71cdb7d Compare May 22, 2018 09:30

annevk removed the needs tests Moving the issue forward requires someone to write tests label May 22, 2018

rmisev approved these changes May 22, 2018

View reviewed changes

annevk force-pushed the annevk/query-state-revamp branch from 71cdb7d to 2518aa4 Compare May 23, 2018 06:41

annevk merged commit f0e4390 into master May 23, 2018

annevk deleted the annevk/query-state-revamp branch May 23, 2018 06:45

hsivonen reviewed May 23, 2018

View reviewed changes

annevk mentioned this pull request May 14, 2020

Refactor parse query and percent-encode sets jsdom/whatwg-url#152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change query state slightly to better deal with non-UTF-8 encodings #386

Change query state slightly to better deal with non-UTF-8 encodings #386

annevk commented May 9, 2018 •

edited by pr-preview bot

Loading

annevk commented May 9, 2018

annevk commented May 9, 2018

annevk May 9, 2018

annevk commented May 9, 2018 •

edited

Loading

annevk commented May 9, 2018

TimothyGu commented May 19, 2018

rmisev left a comment

annevk commented May 22, 2018

annevk commented May 22, 2018

hsivonen left a comment

annevk commented May 23, 2018

Change query state slightly to better deal with non-UTF-8 encodings #386

Change query state slightly to better deal with non-UTF-8 encodings #386

Conversation

annevk commented May 9, 2018 • edited by pr-preview bot Loading

annevk commented May 9, 2018

annevk commented May 9, 2018

annevk May 9, 2018

Choose a reason for hiding this comment

annevk commented May 9, 2018 • edited Loading

annevk commented May 9, 2018

TimothyGu commented May 19, 2018

rmisev left a comment

Choose a reason for hiding this comment

annevk commented May 22, 2018

annevk commented May 22, 2018

hsivonen left a comment

Choose a reason for hiding this comment

annevk commented May 23, 2018

annevk commented May 9, 2018 •

edited by pr-preview bot

Loading

annevk commented May 9, 2018 •

edited

Loading