Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change query state slightly to better deal with non-UTF-8 encodings #386

Merged
merged 2 commits into from
May 23, 2018

Conversation

annevk
Copy link
Member

@annevk annevk commented May 9, 2018

If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;.

The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation.

Tests: [we need to make a number of test changes for this]


Preview | Diff

@annevk annevk added topic: parser needs tests Moving the issue forward requires someone to write tests labels May 9, 2018
@annevk
Copy link
Member Author

annevk commented May 9, 2018

@annevk
Copy link
Member Author

annevk commented May 9, 2018

cc @inexorabletash

url.bs Outdated

<li>
<p>For each <var>byte</var> in <var>buffer</var>:
<p>If <var>bytes</var> starts with 0x26 (&amp;) 0x23 (#) and ends with 0x3B (;), then:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be "starts with `&#`" for consistency.

annevk added a commit to web-platform-tests/wpt that referenced this pull request May 9, 2018
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
@annevk
Copy link
Member Author

annevk commented May 9, 2018

@annevk
Copy link
Member Author

annevk commented May 9, 2018

I'm somewhat hoping this doesn't affect Node.js (I think it should always use UTF-8), but please verify @jasnell.

annevk added a commit to web-platform-tests/wpt that referenced this pull request May 9, 2018
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
annevk added a commit to web-platform-tests/wpt that referenced this pull request May 10, 2018
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
@annevk annevk requested a review from domenic May 17, 2018 07:30
@TimothyGu
Copy link
Member

In Node.js URL parsing always uses UTF-8, that's correct.

@annevk annevk requested a review from hsivonen May 20, 2018 07:59
Copy link
Member

@rmisev rmisev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think algorithm is correct (I didn't test it). Although all characters which must be encoded (&# and ;) are at known positions, so there no need to scan bytes. I would change the 4 and 5 sub-steps as follows:

  1. If bytes starts with `&#` and ends with 0x3B (;), then:
    1. Replace the beginning `&#` in bytes with `%26%23`.
    2. Replace the ending 0x3B (;) in bytes with `%3B`.
    3. Append isomorphic decoded bytes to url’s query.
  2. Otherwise, for each byte in bytes:
    ...

@annevk annevk force-pushed the annevk/query-state-revamp branch from 3b4992f to 71cdb7d Compare May 22, 2018 09:30
@annevk
Copy link
Member Author

annevk commented May 22, 2018

Thanks @rmisev, adopted your suggestion with some slight tweaks to the wording.

annevk added a commit to web-platform-tests/wpt that referenced this pull request May 22, 2018
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
@annevk
Copy link
Member Author

annevk commented May 22, 2018

Given that this now has had review, browser bugs are filed, and test changes have been reviewed as well, I plan on landing this tomorrow unless I hear concerns before that time.

It's blocking further test refactoring efforts somewhat so it'd be good to get this over with.

@annevk annevk removed the needs tests Moving the issue forward requires someone to write tests label May 22, 2018
If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;.

The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation.

Tests: web-platform-tests/wpt#10915.

Fixes whatwg/encoding#139.
@annevk annevk force-pushed the annevk/query-state-revamp branch from 71cdb7d to 2518aa4 Compare May 23, 2018 06:41
@annevk annevk merged commit f0e4390 into master May 23, 2018
@annevk annevk deleted the annevk/query-state-revamp branch May 23, 2018 06:45
sideshowbarker pushed a commit to web-platform-tests/wpt that referenced this pull request May 23, 2018
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
jgraham pushed a commit to web-platform-tests/wpt that referenced this pull request May 23, 2018
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was merged looks correct to me. The formulation looks like implementations are going to want to transform it to be more efficient in the context of real-world encoding APIs. In the future, we might want to add an informative note advising how to do that correctly.

@annevk
Copy link
Member Author

annevk commented May 23, 2018

Yeah, though I wouldn't mind non-UTF-8-performance penalties.

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Jun 5, 2018
…, a=testonly

Automatic update from web-platform-testsURL/Encoding: change query state parsing

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

--

wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30
wpt-pr: 10915
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Oct 16, 2018
…pable code points in URL query state. r=valentin

Spec change: whatwg/url#386

MozReview-Commit-ID: Fa84kCNghtU

Differential Revision: https://phabricator.services.mozilla.com/D8728

--HG--
extra : moz-landing-system : lando
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Oct 3, 2019
…, a=testonly

Automatic update from web-platform-testsURL/Encoding: change query state parsing

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

--

wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30
wpt-pr: 10915

UltraBlame original commit: 13f3705568922e770ec97af2aad3e09e0449caa6
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Oct 3, 2019
…, a=testonly

Automatic update from web-platform-testsURL/Encoding: change query state parsing

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

--

wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30
wpt-pr: 10915

UltraBlame original commit: 13f3705568922e770ec97af2aad3e09e0449caa6
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Oct 3, 2019
…, a=testonly

Automatic update from web-platform-testsURL/Encoding: change query state parsing

See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard.

(I found all these resources in part due to rakuco's work on trying to align Chrome with the earlier iteration of the specification.)

--

wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30
wpt-pr: 10915

UltraBlame original commit: 13f3705568922e770ec97af2aad3e09e0449caa6
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Oct 3, 2019
…pable code points in URL query state. r=valentin

Spec change: whatwg/url#386

MozReview-Commit-ID: Fa84kCNghtU

Differential Revision: https://phabricator.services.mozilla.com/D8728

UltraBlame original commit: 948a4673220c961438955f1c1346ee68e3dd8ff4
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Oct 3, 2019
…pable code points in URL query state. r=valentin

Spec change: whatwg/url#386

MozReview-Commit-ID: Fa84kCNghtU

Differential Revision: https://phabricator.services.mozilla.com/D8728

UltraBlame original commit: 948a4673220c961438955f1c1346ee68e3dd8ff4
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Oct 3, 2019
…pable code points in URL query state. r=valentin

Spec change: whatwg/url#386

MozReview-Commit-ID: Fa84kCNghtU

Differential Revision: https://phabricator.services.mozilla.com/D8728

UltraBlame original commit: 948a4673220c961438955f1c1346ee68e3dd8ff4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants