From daf61a202cd1aa337ccaaa02cefdb725f6ba1555 Mon Sep 17 00:00:00 2001 From: Domenic Denicola Date: Tue, 6 Sep 2016 21:53:53 -0400 Subject: [PATCH] Add XML declaration encoding sniffing Closes #1438, where we found out that this is required for web compatibility. The algorithm given here is an exact copy of that used by WebKit and Blink, with the exception that it does not detect UTF-32 byte sequences since in web-standards-world, UTF-32 must not be supported. --- source | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 108 insertions(+), 7 deletions(-) diff --git a/source b/source index 0c716d83781..9b4e9d38a9d 100644 --- a/source +++ b/source @@ -99996,19 +99996,71 @@ dictionary StorageEventInit : EventInit { encoding, given some defined end condition, then it must run the following steps. These steps either abort unsuccessfully or return a character encoding. If at any point during these steps (including during instances of the get an attribute algorithm invoked by this - one) the user agent either runs out of bytes (meaning the position pointer - created in the first step below goes beyond the end of the byte stream obtained so far) or reaches - its end condition, then abort the prescan a byte stream to determine its - encoding algorithm unsuccessfully.

+ data-x="concept-get-attributes-when-sniffing">get an attribute algorithm invoked by + this one) the user agent either runs out of bytes (meaning the position pointer + created in the second step below goes beyond the end of the byte stream obtained so far) or + reaches its end condition, then if the below fallback encoding variable is + set to a non-null value, abort the prescan a byte stream to determine its encoding + algorithm with fallback encoding as the encoding; otherwise, abort the algorithm + unsuccessfully.

    +
  1. Let fallback encoding be null.

  2. + +
  3. Let position be a pointer to a byte in the input byte stream, initially + pointing at the first byte.

  4. +
  5. +

    Prescan for XML declarations: If position points to:

    + +
    +
    A sequence of bytes starting with: 0x3C, 0x3F, 0x78, 0x6C (case-sensitive ASCII + '<?xml')
    +
    +

    Get an XML encoding. If this + does not return failure, set fallback encoding to the returned encoding, and then + continue with this algorithm.

    +
    + +
    A sequence of bytes starting with: 0x3C, 0x0, 0x3F, 0x0, 0x78, 0x0 (case-sensitive UTF-16 + little-endian '<?xm')
    +
    -

    Let position be a pointer to a byte in the input byte stream, initially - pointing at the first byte.

    +

    Abort the prescan a byte stream to determine its encoding algorithm, + returning UTF-16LE.

    + +
    +
    A sequence of bytes starting with: 0x0, 0x3C, 0x0, 0x3F, 0x0, 0x78 (case-sensitive UTF-16 + big-endian '<?xm')
    +
    +

    Abort the prescan a byte stream to determine its encoding algorithm, + returning UTF-16BE.

    +
    + + +
    + +

    Prescanning for XML declarations, even in HTML documents, must be done for + compatibility with legacy content. See issue #1438.

  6. @@ -100299,6 +100351,55 @@ dictionary StorageEventInit : EventInit {
+

When the prescan a byte stream to determine its encoding algorithm says to get an XML encoding, it means doing + this. If at any point during these steps the encodingPosition pointer created in the + first step below goes beyond the end of the byte stream obtained so far, abort the get an XML encoding algorithm and return + failure.

+ +
    +
  1. Let encodingPosition be a distinct pointer to the same place in the input byte + stream as position.

  2. + +
  3. Let xmlDeclarationEnd be a pointer to the next byte in the input byte + stream which is 0x3E (ASCII '>'). If there is no such byte, abort the get an XML encoding algorithm algorithm + and return failure

  4. + +
  5. Set encodingPosition to the position of the first occurrence of the subsequence + of bytes 0x65, 0x6E, 0x63, 0x6F, 0x64, 0x69, 0x6E, 0x67 (ASCII 'encoding') at or after the + current encodingPosition. If there is no such sequence, abort the get an XML encoding algorithm algorithm + and return failure.

  6. + +
  7. Advance encodingPosition past the 0x67 (ASCII 'g') byte.

  8. + +
  9. While the byte at encodingPosition is less than or equal to 0x20 (i.e. it is + either an ASCII space or control character), advance encodingPosition to the next + byte.

  10. + +
  11. If the byte at encodingPosition is not 0x3D (ASCII =), abort the get an XML encoding algorithm algorithm + and return failure.

  12. + +
  13. Let quoteMark be the byte at encodingPosition.

  14. + +
  15. Advance encodingPosition to the next byte.

  16. + +
  17. Let encodingEndPosition be the position of the next occurence of + quoteMark at or after encodingPosition. If quoteMark does not + occur again, abort the get an XML encoding + algorithm algorithm and return failure.

  18. + +
  19. Let potentialEncoding be the Unicode string whose code points are the same as + the values of the bytes between encodingPosition (inclusive) and + encodingEndPosition (exlusive).

  20. + +
  21. Return the result of getting an encoding given + potentialEncoding.

  22. +
+

For the sake of interoperability, user agents should not use a pre-scan algorithm that returns different results than the one described above. (But, if you do, please at least let us know, so that we can improve this algorithm and benefit everyone...)