-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change query state slightly to better deal with non-UTF-8 encodings #386
Conversation
This ends up fixing whatwg/encoding#139. https://bugs.chromium.org/p/chromium/issues/detail?id=795733 can be closed. |
url.bs
Outdated
|
||
<li> | ||
<p>For each <var>byte</var> in <var>buffer</var>: | ||
<p>If <var>bytes</var> starts with 0x26 (&) 0x23 (#) and ends with 0x3B (;), then: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be "starts with `&#
`" for consistency.
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
I'm somewhat hoping this doesn't affect Node.js (I think it should always use UTF-8), but please verify @jasnell. |
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
In Node.js URL parsing always uses UTF-8, that's correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think algorithm is correct (I didn't test it). Although all characters which must be encoded (&#
and ;
) are at known positions, so there no need to scan bytes. I would change the 4 and 5 sub-steps as follows:
- If bytes starts with `
&#
` and ends with 0x3B (;), then:- Replace the beginning `
&#
` in bytes with `%26%23
`. - Replace the ending 0x3B (;) in bytes with `
%3B
`. - Append isomorphic decoded bytes to url’s query.
- Replace the beginning `
- Otherwise, for each byte in bytes:
...
3b4992f
to
71cdb7d
Compare
Thanks @rmisev, adopted your suggestion with some slight tweaks to the wording. |
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
Given that this now has had review, browser bugs are filed, and test changes have been reviewed as well, I plan on landing this tomorrow unless I hear concerns before that time. It's blocking further test refactoring efforts somewhat so it'd be good to get this over with. |
If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;. The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation. Tests: web-platform-tests/wpt#10915. Fixes whatwg/encoding#139.
71cdb7d
to
2518aa4
Compare
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was merged looks correct to me. The formulation looks like implementations are going to want to transform it to be more efficient in the context of real-world encoding APIs. In the future, we might want to add an informative note advising how to do that correctly.
Yeah, though I wouldn't mind non-UTF-8-performance penalties. |
…, a=testonly Automatic update from web-platform-testsURL/Encoding: change query state parsing See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to @rakuco's work on trying to align Chrome with the earlier iteration of the specification.) -- wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30 wpt-pr: 10915
…pable code points in URL query state. r=valentin Spec change: whatwg/url#386 MozReview-Commit-ID: Fa84kCNghtU Differential Revision: https://phabricator.services.mozilla.com/D8728 --HG-- extra : moz-landing-system : lando
…, a=testonly Automatic update from web-platform-testsURL/Encoding: change query state parsing See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to rakuco's work on trying to align Chrome with the earlier iteration of the specification.) -- wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30 wpt-pr: 10915 UltraBlame original commit: 13f3705568922e770ec97af2aad3e09e0449caa6
…, a=testonly Automatic update from web-platform-testsURL/Encoding: change query state parsing See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to rakuco's work on trying to align Chrome with the earlier iteration of the specification.) -- wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30 wpt-pr: 10915 UltraBlame original commit: 13f3705568922e770ec97af2aad3e09e0449caa6
…, a=testonly Automatic update from web-platform-testsURL/Encoding: change query state parsing See whatwg/encoding#139 for rationale and whatwg/url#386 for the change to the URL Standard. (I found all these resources in part due to rakuco's work on trying to align Chrome with the earlier iteration of the specification.) -- wpt-commits: e399a2c694345240639c23cc5e9e4f077a47cf30 wpt-pr: 10915 UltraBlame original commit: 13f3705568922e770ec97af2aad3e09e0449caa6
…pable code points in URL query state. r=valentin Spec change: whatwg/url#386 MozReview-Commit-ID: Fa84kCNghtU Differential Revision: https://phabricator.services.mozilla.com/D8728 UltraBlame original commit: 948a4673220c961438955f1c1346ee68e3dd8ff4
…pable code points in URL query state. r=valentin Spec change: whatwg/url#386 MozReview-Commit-ID: Fa84kCNghtU Differential Revision: https://phabricator.services.mozilla.com/D8728 UltraBlame original commit: 948a4673220c961438955f1c1346ee68e3dd8ff4
…pable code points in URL query state. r=valentin Spec change: whatwg/url#386 MozReview-Commit-ID: Fa84kCNghtU Differential Revision: https://phabricator.services.mozilla.com/D8728 UltraBlame original commit: 948a4673220c961438955f1c1346ee68e3dd8ff4
If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;.
The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation.
Tests: [we need to make a number of test changes for this]
Preview | Diff