-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XML declaration encoding sniffing #1752
Conversation
See #1438 (comment) for my take on the WebKit/Blink behavior around parsing the XML declaration... Note that Edge does not do that, which is why I kept trying to get some information on what it does do. |
source
Outdated
of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (ASCII 'encoding') at or after the | ||
current <var>encodingPosition</var>. If there is no such sequence, abort the <span | ||
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm | ||
and return failure.</p></li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could return failure here if the byte before the 'e' is not less than or equal to 0x20, to fix the fooencoding="utf-8" problem. (I haven't tested what Edge does though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we should tokenize attributes as is done for meta
scanner (consider <?xml foo=" encoding='utf-8'"?>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept the WebKit-compatible behavior.
I'm going to tag this as "do not merge yet" until we either:
|
I think we should add WebKit/Blink-compatible XML declaration sniffing. |
(I disapprove of WebKit improvising on this back when it did, but now that IE and Spartan are going away, the behavior is part of the compatibility legacy.) |
I rebased this, addressed the obvious feedback, but two comments from Simon are still remaining. If someone has a pointer to where this is implemented in Chromium/WebKit I'd be happy to read through that and do a comparison. I also wonder if there's a better way of writing this down as the jumping around doesn't seem great. |
I'm OK with going with the implemented behavior. The parts that are bogus can be called out with notes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some clear bugs and one issue that matches Blink, but I think matching Blink is harmful due to the network buffer boundary dependency.
source
Outdated
@@ -102094,6 +102128,55 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> { | |||
step.</p></li> | |||
</ol> | |||
|
|||
<p>When the <span>prescan a byte stream to determine its encoding</span> algorithm says to <dfn | |||
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</dfn>, it means doing | |||
this. If at any point during these steps the <var>encodingPosition</var> pointer created in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to match Blink, but I think it's bad to make this dependent on network byte buffer boundaries. If I was writing this from scratch, I'd make a state machine that'd fail fast if it doesn't match encoding=
next, but let's try to stay closer to what Blink does. I'd prefer to require scanning exactly the first 1024 for >
. However, the spec presently allows for stopping earlier, which is bad.
I suggest saying that the browser must scan for >
for as far as it would scan for meta charset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I didn't fix this one, yet.
Hmm. In fact, since |
Hmm. Or at least |
It looks like the UTF-16 cases take precedence over Proof: So, the UTF-16 cases should be checked before If |
So cool to see this revived; thanks @hsivonen for the prod and the review, and thanks @annevk for rebasing. Since this was originally proposed the WHATWG has adopted a tests-required policy, so I hope we can also take the opportunity to add web platform tests for these cases. @hsivonen, it seems like you're working on this area in Gecko; will you be able to do that as part of your implementation? |
I can write tests, yes. |
Ping @hsivonen; did you end up writing tests for this? |
Manual demos, not in WPT form yet: https://hsivonen.com/test/moz/xml-decl/ |
I'm unhappy that it looks like WebKit managed to give UTF-16 The precedence appears to be:
It's sad that WebKit added a check for UTF-16 |
Looks like WebKit and Blink don't use get an encoding: https://hsivonen.com/test/moz/xml-decl/space-around-label.htm |
WebKit and Blink don't trim ASCII whitespace when resolving the encoding label from an XML declaration in text/html but do perform alias resolution: This makes sense in the light of the XML spec not allowing ASCII whitespace around the label: If we don't want to introduce get an encoding without trimming in the Encoding Standard, we can make the spec text here bail out if there's ASCII whitespace inside the quotes. |
I'm sorry, I'm hearing a lot of history, some "this is strange", and some of what seems like the blame game. But I still haven't heard a clear articulation of what the benefits would be, either to web users, or to implementations, or to interop... Could you phrase this in terms of, e.g., pages that are broken with WebKit and Blink's current implementation, but would be fixed by the spec change you're proposing here? It sounds like a small change, but if there's literally no benefits, then it's hard to argue for it. Perhaps another way of phrasing it is, what part of the priority of constituencies is this change targeting? |
The benefit is cross-engine consistency (i.e. interop) on how far to scan. Concretely, whether it's legitimate to land WPTs that test exactly how far to scan. At this point, it's not known that it fixes or breaks any non-test page in particular; the limit is substantially higher than the length of a plausible XML declaration. For other features, we typically specify exact behavior and write tests for the boundary conditions, even if there's no evidence of concrete Web compat concerns at hand. The expected benefit for testing an exact boundary condition here is low, but the same argument could be made about many other edge case tests that we write as a matter of testing culture that treats edge case differences in implementation and "implementation-defined" in spec text as presumptively suspect. Would you like me to remove the tests that seek to show that the browser does not scan past 1024 bytes (Blink and WebKit don't pass)? Would you like me to remove the tests that seek show that the browser scans all the way to byte 1024 (Blink and WebKit pass given the current buffering provided by the WPT server; I haven't investigated if they'd pass with different buffer boundaries or timing, but I haven't done that investigation for any of the tests)? |
Further, should this go into WPT? https://phabricator.services.mozilla.com/differential/changeset/?ref=3707434 Firefox with my patch parses the XML declaration as declaring It's unclear to me what exactly makes Chrome give up. Making an XML declaration megabytes long doesn't make it give up. Causing the stream to stall for 20 seconds in the middle of an XML declaration doesn't make it give up. Making a normal-sized XML declaration trickle in byte by byte makes it give up. So far, my hypothesis is that Chrome gives up after some number of network chunks that it considers contiguous, but I'd appreciate a real explanation. |
@mfreed7 is this something you could help with? From Mozilla's side we'd really like the bytes to document operation to be fully deterministic, i.e., not have it use implementation-defined. |
To elaborate on why I want to limit this to 1024 bytes:
Regarding Since real bogo-XML declarations are generally not longer than 45 bytes, at the point we have 1024 bytes available, it's unreasonable to add complexity to handle cases where there is excessive whitespace that, in principle, fits the syntax: we're already accommodating a 20-fold excess length. In CSS, we have precedent for drawing the line to 1024 bytes for such theoretical infinite whitespace opportunity. Under the Well-defined Behavior Design Principle, we should write the limit in the spec as in done in CSS even though in both cases it's so large relative to the real-world usage of the syntax that the exact limit is unlikely to have much Web compat impact. |
I added a bit of text to handle UTF-16BE/LE declared using the ASCII-based syntax. The WPTs landed, but it looks like I will need to split the test to avoid timeouts. |
This just got re-assigned to me. My personal next step would be to remove the Gecko-only 1024 limitation so we can match 2/3 engines, and then remove or flip the pass conditions on any web platform tests which assume that if possible. (Although I'm not sure if it is.) |
What's your opinion on the analogous 1024-byte limit in CSS? What's your opinion on the |
It looks like for Web compat, Gecko will have to extend reloadless Let's drop the 1024-byte limit for XML declaration. I'd still like to know how developers of the other engines view the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. So I guess the main thing remaining here is for @mfreed7 to look at the tests. Is there a follow-up issue for changing meta
element scanning accordingly?
the first step below goes beyond the end of the byte stream obtained so far) or reaches its | ||
<var>end condition</var>, then abort the <span>prescan a byte stream to determine its | ||
encoding</span> algorithm unsuccessfully.</p> | ||
encoding</dfn>, given some defined <dfn data-x="prescan-end-condition"><var>end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is now a named argument we should add:
export for="prescan a byte stream to determine its encoding"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this is exported in the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
<li><p>Advance <var>encodingPosition</var> past the 0x67 (g) byte.</p></li> | ||
|
||
<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comma after i.e.
|
||
<li><p>If the byte at <var>encodingPosition</var> is not 0x3D (=), then return failure.</p></li> | ||
|
||
<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
This LGTM. Test-wise, it looks like there are four test cases that need to change to re-remove the 1024-byte limit, right? I.e. the four that fail here? And then there's the trickle.py test. I do not understand this one. When I run it locally, and when it runs on the Chromium automation bots, it passes. Not even flaky. So I'm not exactly sure why it is failing on WPT.fyi, seemingly consistently. I also inspected the code for some time, and I don't see any reason why this should fail. Suggestions appreciated here! The failure is that it assumes windows-1252 for some reason. |
Ok, I might have found at least a clue. It is possible that this is due to a timeout of sorts in the test. Each of the Speculatively, we could split just the trickle.py test into a new xmldecl-3.html file, perhaps? |
I'm working on landing a split of the trickle.py test, to see if that helps: web-platform-tests/wpt#28874 |
So I landed the changes to trickle.py (splitting it out and marking it as a slow test) and that solved the problems on Chromium bots. However, it still fails on wpt.fyi: I don't understand the difference between WPT.fyi and our bots. Suggestions/thoughts appreciated. |
Is there any chance it's a content_shell vs. full chrome browser issue? Seems unlikely... |
Hmm, well that's interesting. Yes. I hadn't tried the full browser but it appears to fail there, at least locally. I'll give that a look, thanks! So WPT.fyi uses a full browser? I thought it was using content_shell like the bots. TIL. |
Actually no, maybe false alarm. Locally, it was just downloading the trickle.py file itself, rather than executing it... |
Yep, definitely false alarm. When actually running run_blink_wptserve.py, it works perfectly from a full browser also. Still, thanks for the suggestion. |
For me, http://wpt.live/html/syntax/xmldecl/xmldecl-3.html passes on Canary and fails on Stable. This is on Windows. Not really sure what that indicates... |
If I run |
I'm in the process of updating the WPT expectation around the kilobyte boundary. Thanks for getting this in the spec! |
Closes #1438, where we found out that this is required for web
compatibility. The algorithm given here is an exact copy of that used by
WebKit and Blink, with the exception that it does not detect UTF-32 byte
sequences since in web-standards-world, UTF-32 must not be supported.
@bzbarsky FYI. Would appreciate some careful editor review on this (@zcorpan seems especially good at byte-parsing algorithms) as while I tried to match these algorithms to the surrounding style, I'm not sure I succeeded. It would also be good to have someone else double-check the correspondence between this spec text and Blink/WebKit's algorithms; the fact that they use the ASCII XML declaration as a hint but then also check meta charset was a bit strange, but I think I captured that.
(PR template added by @hsivonen )
/parsing.html ( diff )