Skip to content

Commit

Permalink
Add XML declaration encoding sniffing
Browse files Browse the repository at this point in the history
Closes #1438, where we found out that this is required for web compatibility. The algorithm given here matches that used by WebKit and Blink, with the exception that it does not detect UTF-32 byte sequences since in web-standards-world, UTF-32 must not be supported.

Co-authored-by: Henri Sivonen <hsivonen@hsivonen.fi>
  • Loading branch information
domenic and hsivonen authored May 6, 2021
1 parent 083c65c commit 800a2dc
Showing 1 changed file with 96 additions and 15 deletions.
111 changes: 96 additions & 15 deletions source
Original file line number Diff line number Diff line change
Expand Up @@ -104597,13 +104597,14 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {

<li>
<p>Optionally <span data-x="prescan a byte stream to determine its encoding">prescan the byte
stream to determine its encoding</span>. The <var>end condition</var> is that the user
agent decides that scanning further bytes would not be efficient. User agents are encouraged to
only prescan the first 1024 bytes. User agents may decide that scanning <em>any</em> bytes is
not efficient, in which case these substeps are entirely skipped.</p>

<p>The aforementioned algorithm either aborts unsuccessfully or returns a character encoding. If
it returns a character encoding, then return the same encoding, with <span
stream to determine its encoding</span>, with the <i><span data-x="prescan-end-condition">end
condition</span></i> being when the user agent decides that scanning further bytes would not be
efficient. User agents are encouraged to only prescan the first 1024 bytes. User agents may
decide that scanning <em>any</em> bytes is not efficient, in which case these substeps are
entirely skipped.</p>

<p>The aforementioned algorithm returns either a character encoding or failure. If it returns a
character encoding, then return the same encoding, with <span
data-x="concept-encoding-confidence">confidence</span> <i>tentative</i>.</p>
</li>

Expand Down Expand Up @@ -105081,19 +105082,40 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
<hr>

<p>When an algorithm requires a user agent to <dfn export>prescan a byte stream to determine its
encoding</dfn>, given some defined <var>end condition</var>, then it must run the following steps.
These steps either abort unsuccessfully or return a character encoding. If at any point during
these steps (including during instances of the <span
encoding</dfn>, given some defined <dfn data-x="prescan-end-condition" export for="prescan a byte
stream to determine its encoding"><var>end condition</var></dfn>, then it must run the following
steps. If at any point during these steps (including during instances of the <span
data-x="concept-get-attributes-when-sniffing">get an attribute</span> algorithm invoked by this
one) the user agent either runs out of bytes (meaning the <var>position</var> pointer created in
the first step below goes beyond the end of the byte stream obtained so far) or reaches its
<var>end condition</var>, then abort the <span>prescan a byte stream to determine its
encoding</span> algorithm unsuccessfully.</p>
encoding</span> algorithm and return the result <span
data-x="concept-get-xml-encoding-when-sniffing">get an XML encoding</span> applied to the same
bytes that the <span>prescan a byte stream to determine its encoding</span> algorithm was applied
to. Otherwise, these steps will return a character encoding.</p>

<ol>
<li><p>Let <var>fallback encoding</var> be null.</p></li>

<li><p>Let <var>position</var> be a pointer to a byte in the input byte stream, initially
pointing at the first byte.</p></li>

<li>
<p>Let <var>position</var> be a pointer to a byte in the input byte stream, initially
pointing at the first byte.</p>
<p>Prescan for UTF-16 XML declarations: If <var>position</var> points to:</p>

<dl class="switch">
<dt>A sequence of bytes starting with: 0x3C, 0x0, 0x3F, 0x0, 0x78, 0x0 (case-sensitive UTF-16
little-endian '&lt;?x')</dt>
<dd><p>Return <span>UTF-16LE</span>.</p></dd>

<dt>A sequence of bytes starting with: 0x0, 0x3C, 0x0, 0x3F, 0x0, 0x78 (case-sensitive UTF-16
big-endian '&lt;?x')</dt>
<dd><p>Return <span>UTF-16BE</span>.</p></dd>
</dl>

<p class="note">For historical reasons, the prefix is two bytes longer than in <a
href="https://www.w3.org/TR/REC-xml/#sec-guessing">Appendix F</a> of <cite>XML</cite> and the
encoding name is not checked.</p>
</li>

<li>
Expand Down Expand Up @@ -105179,8 +105201,7 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
<li><p>If <var>charset</var> is <span>x-user-defined</span>, then set <var>charset</var> to
<span>windows-1252</span>.</p></li>

<li><p>Abort the <span>prescan a byte stream to determine its encoding</span> algorithm,
returning the encoding given by <var>charset</var>.</p></li>
<li><p>Return <var>charset</var>.</p></li>
</ol>
</dd>

Expand Down Expand Up @@ -105357,6 +105378,64 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
step.</p></li>
</ol>

<p>When the <span>prescan a byte stream to determine its encoding</span> algorithm is aborted
without returning an encoding, <dfn data-x="concept-get-xml-encoding-when-sniffing">get an XML
encoding</dfn> means doing this.</p>

<p class="note">Looking for syntax resembling an XML declaration, even in <code>text/html</code>,
is necessary for compatibility with existing content.</p>

<ol>
<li><p>Let <var>encodingPosition</var> be a pointer to the start of the stream.</p></li>

<li><p>If <var>encodingPosition</var> does not point to the start of a byte sequence 0x3C, 0x3F,
0x78, 0x6D, 0x6C (`<code data-x="">&lt;?xml</code>`), then return failure.</p></li>

<li><p>Let <var>xmlDeclarationEnd</var> be a pointer to the next byte in the input byte stream
which is 0x3E (>). If there is no such byte, then return failure.</p></li>

<li><p>Set <var>encodingPosition</var> to the position of the first occurrence of the subsequence
of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (`<code data-x="">encoding</code>`) at or
after the current <var>encodingPosition</var>. If there is no such sequence, then return
failure.</p></li>

<li><p>Advance <var>encodingPosition</var> past the 0x67 (g) byte.</p></li>

<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e., it is
either an ASCII space or control character), advance <var>encodingPosition</var> to the next
byte.</p></li>

<li><p>If the byte at <var>encodingPosition</var> is not 0x3D (=), then return failure.</p></li>

<li><p>While the byte at <var>encodingPosition</var> is less than or equal to 0x20 (i.e., it is
either an ASCII space or control character), advance <var>encodingPosition</var> to the next
byte.</p></li>

<li><p>Let <var>quoteMark</var> be the byte at <var>encodingPosition</var>.</p></li>

<li><p>If <var>quoteMark</var> is not either 0x22 (") or 0x27 ('), then return failure.</p></li>

<li><p>Advance <var>encodingPosition</var> to the next byte.</p></li>

<li><p>Let <var>encodingEndPosition</var> be the position of the next occurence of
<var>quoteMark</var> at or after <var>encodingPosition</var>. If <var>quoteMark</var> does not
occur again, then return failure.</p></li>

<li><p>Let <var>potentialEncoding</var> be the sequence of the bytes between
<var>encodingPosition</var> (inclusive) and <var>encodingEndPosition</var> (exlusive).</p></li>

<li><p>If <var>potentialEncoding</var> contains one or more bytes whose byte value is 0x20 or
below, then return failure.</p></li>

<li><p>Let <var>encoding</var> be the result of <span>getting an encoding</span> given
<var>potentialEncoding</var> <span data-x="isomorphic decode">isomorphic decoded</span>.</p></li>

<li><p>If the <var>encoding</var> is <span>UTF-16BE/LE</span>, then change it to
<span>UTF-8</span>.</p></li>

<li><p>Return <var>encoding</var>.</p></li>
</ol>

<p>For the sake of interoperability, user agents should not use a pre-scan algorithm that returns
different results than the one described above. (But, if you do, please at least let us know, so
that we can improve this algorithm and benefit everyone...)</p>
Expand Down Expand Up @@ -105385,6 +105464,8 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {
<dfn data-x-href="https://encoding.spec.whatwg.org/#iso-2022-jp">ISO-2022-JP</dfn>,
<dfn data-x-href="https://encoding.spec.whatwg.org/#shift_jis">Shift_JIS</dfn>,
<dfn data-x-href="https://encoding.spec.whatwg.org/#euc-kr">EUC-KR</dfn>,
<dfn data-x-href="https://encoding.spec.whatwg.org/#utf-16be">UTF-16BE</dfn>,
<dfn data-x-href="https://encoding.spec.whatwg.org/#utf-16le">UTF-16LE</dfn>,
<dfn data-x-href="https://encoding.spec.whatwg.org/#utf-16be-le"
id="utf-16-encoding">UTF-16BE/LE</dfn>, and
<dfn data-x-href="https://encoding.spec.whatwg.org/#x-user-defined">x-user-defined</dfn>.
Expand Down

0 comments on commit 800a2dc

Please sign in to comment.