Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 and declaration of the encoding #520

Closed
heidivanparys opened this issue Nov 26, 2024 · 3 comments
Closed

UTF-8 and declaration of the encoding #520

heidivanparys opened this issue Nov 26, 2024 · 3 comments

Comments

@heidivanparys
Copy link
Member

I recently came across W3C's Encoding Standard. In 4.2. Names and labels, it specifies:

Authors must use the UTF-8 encoding and must use its (ASCII case-insensitive) "utf-8" label to identify it.

New protocols and formats, as well as existing formats deployed in new contexts, must use the UTF-8 encoding exclusively. If these protocols and formats need to expose the encoding’s name or label, they must expose it as "utf-8".

That subclause is referenced from e.g. the HTML specification, see 4.2.5.4 Specifying the document's character encoding:

The Encoding standard requires use of the UTF-8 character encoding and requires use of the "utf-8" encoding label to identify it. Those requirements necessitate that the document's character encoding declaration, if it exists, specifies an encoding label using an ASCII case-insensitive match for "utf-8". Regardless of whether a character encoding declaration is present or not, the actual character encoding used to encode the document must be UTF-8. [ENCODING]

So the requirement from the Encoding Standard actually overrules the recommendation from the XML standard, 4.3.3 Character Encoding in Entities, which specifies that:

In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 ", and " ISO-10646-UCS-4 " SHOULD be used for the various encodings and transformations of Unicode / ISO/IEC 10646, [...]

How does this impact TC 211's standards and resources? I guess it mainly would impact the XMG resources (encoding declaration has to be <?xml version="1.0" encoding="utf-8"?> instead of <?xml version="1.0" encoding="UTF-8"?>). The standards impacted probably mainly originate from OGC.

@PeterParslow
Copy link
Contributor

Heidi,
given that both the W3C sources you cite are explicit that it is a case-insensitive label I see no reason to change from UTF-8 to utf-8 or vice versa.

@ReesePlews
Copy link
Contributor

@heidivanparys @PeterParslow could you please attach a label to the issue so it can be filtered for easy identification. thank you.

@heidivanparys
Copy link
Member Author

I will just close the issue (being also the one who opened it). I think the key part for TC 211 is the phrase “new protocols and formats”. XML is not new, and we don't “deploy” it “in a new context”, so the requirement does not apply, as far as I can see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants