Skip to content

Commit

Permalink
Add XML declaration encoding sniffing
Browse files Browse the repository at this point in the history
Closes #1438, where we found out that this is required for web
compatibility. The algorithm given here is an exact copy of that used by
WebKit and Blink, with the exception that it does not detect UTF-32 byte
sequences since in web-standards-world, UTF-32 must not be supported.
  • Loading branch information
domenic committed Sep 7, 2016
1 parent 85227d2 commit daf61a2
Showing 1 changed file with 108 additions and 7 deletions.
115 changes: 108 additions & 7 deletions source
Original file line number Diff line number Diff line change
Expand Up @@ -99996,19 +99996,71 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
encoding</dfn>, given some defined <var>end condition</var>, then it must run the
following steps. These steps either abort unsuccessfully or return a character encoding. If at any
point during these steps (including during instances of the <span
data-x="concept-get-attributes-when-sniffing">get an attribute</span> algorithm invoked by this
one) the user agent either runs out of bytes (meaning the <var>position</var> pointer
created in the first step below goes beyond the end of the byte stream obtained so far) or reaches
its <var>end condition</var>, then abort the <span>prescan a byte stream to determine its
encoding</span> algorithm unsuccessfully.</p>
data-x="concept-get-attributes-when-sniffing">get an attribute</span> algorithm invoked by
this one) the user agent either runs out of bytes (meaning the <var>position</var> pointer
created in the second step below goes beyond the end of the byte stream obtained so far) or
reaches its <var>end condition</var>, then if the below <var>fallback encoding</var> variable is
set to a non-null value, abort the <span>prescan a byte stream to determine its encoding</span>
algorithm with <var>fallback encoding</var> as the encoding; otherwise, abort the algorithm
unsuccessfully.</p>

<ol>

<li><p>Let <var>fallback encoding</var> be null.</p></li>

<li><p>Let <var>position</var> be a pointer to a byte in the input byte stream, initially
pointing at the first byte.</p></li>

<li>
<p>Prescan for XML declarations: If <var>position</var> points to:</p>

<dl class="switch">
<dt>A sequence of bytes starting with: 0x3C, 0x3F, 0x78, 0x6C (case-sensitive ASCII
'&lt;?xml')</dt>
<dd>
<p><span data-x="concept-get-xml-encoding-when-sniffing">Get an XML encoding</span>. If this
does not return failure, set <var>fallback encoding</var> to the returned encoding, and then
continue with this algorithm.</p>
</dd>

<dt>A sequence of bytes starting with: 0x3C, 0x0, 0x3F, 0x0, 0x78, 0x0 (case-sensitive UTF-16
little-endian '&lt;?xm')</dt>
<dd>

<p>Let <var>position</var> be a pointer to a byte in the input byte stream, initially
pointing at the first byte.</p>
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
returning <span>UTF-16LE</span>.</p>

</dd>

<dt>A sequence of bytes starting with: 0x0, 0x3C, 0x0, 0x3F, 0x0, 0x78 (case-sensitive UTF-16
big-endian '&lt;?xm')</dt>
<dd>
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
returning <span>UTF-16BE</span>.</p>
</dd>

<!-- the Encoding Standard doesn't support UTF-32:
https://github.com/whatwg/html/issues/1438#issuecomment-245142577

<dt>A sequence of bytes starting with: 0x3C, 0x0, 0x0, 0x0, 0x3F, 0x0, 0x0, 0x0 (case-sensitive
UTF-32 little-endian '&lt;?')</dt>
<dd>
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
returning <span>UTF-32LE</span>.</p>
</dd>

<dt>A sequence of bytes starting with: 0x0, 0x0, 0x0, 0x3C, 0x0, 0x0, 0x0, 0x3F (case-sensitive
UTF-32 big-endian '&lt;?')</dt>
<dd>
<p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
returning <span>UTF-32BE</span>.</p>
</dd>
-->
</dl>

<p class="note">Prescanning for XML declarations, even in HTML documents, must be done for
compatibility with legacy content. See <a
href="https://github.com/whatwg/html/issues/1438">issue #1438</a>.</p>
</li>

<li>
Expand Down Expand Up @@ -100299,6 +100351,55 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {

</ol>

<p>When the <span>prescan a byte stream to determine its encoding</span> algorithm says to <dfn
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</dfn>, it means doing
this. If at any point during these steps the <var>encodingPosition</var> pointer created in the
first step below goes beyond the end of the byte stream obtained so far, abort the <span
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</span> algorithm and return
failure.</p>

<ol>
<li><p>Let <var>encodingPosition</var> be a distinct pointer to the same place in the input byte
stream as <var>position</var>.</p></li>

<li><p>Let <var>xmlDeclarationEnd</var> be a pointer to the next byte in the input byte
stream which is 0x3E (ASCII '>'). If there is no such byte, abort the <span
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
and return failure</p></li>

<li><p>Set <var>encodingPosition</var> to the position of the first occurrence of the subsequence
of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (ASCII 'encoding') at or after the
current <var>encodingPosition</var>. If there is no such sequence, abort the <span
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
and return failure.</p></li>

<li><p>Advance <var>encodingPosition</var> past the 0x67 (ASCII 'g') byte.</p></li>

<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e. it is
either an ASCII space or control character), advance <var>encodingPosition</var> to the next
byte.</p></li>

<li><p>If the byte at <var>encodingPosition</var> is not 0x3D (ASCII =), abort the <span
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding algorithm</span> algorithm
and return failure.</p></li>

<li><p>Let <var>quoteMark</var> be the byte at <var>encodingPosition</var>.</p></li>

<li><p>Advance <var>encodingPosition</var> to the next byte.</p></li>

<li><p>Let <var>encodingEndPosition</var> be the position of the next occurence of
<var>quoteMark</var> at or after <var>encodingPosition</var>. If <var>quoteMark</var> does not
occur again, abort the <span data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding
algorithm</span> algorithm and return failure.</p></li>

<li><p>Let <var>potentialEncoding</var> be the Unicode string whose code points are the same as
the values of the bytes between <var>encodingPosition</var> (inclusive) and
<var>encodingEndPosition</var> (exlusive).</p></li>

<li><p>Return the result of <span>getting an encoding</span> given
<var>potentialEncoding</var>.</p></li>
</ol>

<p>For the sake of interoperability, user agents should not use a pre-scan algorithm that returns
different results than the one described above. (But, if you do, please at least let us know, so
that we can improve this algorithm and benefit everyone...)</p>
Expand Down

0 comments on commit daf61a2

Please sign in to comment.