Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify Latin-1 encoding requirements of text chunks #69

Closed
randy408 opened this issue Jan 11, 2022 · 7 comments
Closed

Clarify Latin-1 encoding requirements of text chunks #69

randy408 opened this issue Jan 11, 2022 · 7 comments
Labels
blocking-3rd-edition-wd i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs resolved discussion

Comments

@randy408
Copy link

randy408 commented Jan 11, 2022

The text fields of tEXt, zTXt chunks is supposed to be Latin-1 and UTF-8 text is supposed to go in iTXt chunks, it implies UTF-8 is not allowed for tEXt, zTXt.

11.3.4.3 tEXt Textual data

... Text is interpreted according to the Latin-1 character set [ISO-8859-1]. The text string may contain any Latin-1 character. Newlines in the text string should be represented by a single linefeed character (decimal 10). Characters other than those defined in Latin-1 plus the linefeed character have no defined meaning in tEXt chunks. Text containing characters outside the repertoire of ISO/IEC 8859-1 should be encoded using the iTXt chunk.

The reference implementation never verifies the text fields of these chunks(1 2 3), this can lead to compatibility issues when another implementation treats non-Latin-1 text as invalid when writing or reading these chunks (randy408/libspng#123).

If Latin-1 encoding is not enforced in the real world then the standard should clearly state it's only a recommendation.

@svgeesus svgeesus added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs resolved discussion labels Jan 17, 2022
@svgeesus
Copy link
Contributor

Concretely, if someone drops 8859-7 Greek into a tEXt chunk instead of 8859-1 Western Latin then there is little that libpng could do to detect this.

So it seems reasonable that an image library says that text handling is left up to the application. libpng will un-compress a xTXt then hand it over.

To then say that the PNG specification should allow any old 8-bit character encoding, or allow UTF-8 encoded XMP data (which is what your original bug report seems to boil down to) is poor advice.

Instead, if we are giving advice at all, it would be to deprecate Latin-1 chunks in favor of the UTF-8 replacements, which allow any text and which perfectly support Latin text as well.

The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.
https://encoding.spec.whatwg.org/

@Crissov
Copy link

Crissov commented Jan 18, 2022

It may be worthy of a separate issue, but should PNG be aligned with the WhatWG Encoding standard which makes ISO 8859-1 (and US ASCII as well) an alias of its superset Windows-1252?

@aphillips
Copy link

@Crissov That would make sense: the primary difference between ISO 8859-1 and windows-1252 is that bytes in the range 0x80 through 0x9F are (rarely used) C1 control characters in the former vs. being assigned characters (such as U+20AC, i.e. EURO SIGN) in the latter. If there are bytes in that range, it is almost always much more useful to interpret them as being windows-1252 as a result.

@randy408
Copy link
Author

randy408 commented Jan 18, 2022

Concretely, if someone drops 8859-7 Greek into a tEXt chunk instead of 8859-1 Western Latin then there is little that libpng could do to detect this.

8859-7 Greek has printable characters outside the Latin-1 ranges (between 126 and 161), most of these character sets share the same subset of non-printable characters which the standard tries to exclude.

This isn't limited to the text field but the Latin-1 encoding restriction in general, including the keyword field. The wording for that is more explicit but libpng does not enforce that either when decoding (link 2 in OP), both of these are potential sources of incompatibility e.g. one implementation might discard the chunk while libpng keeps it.

Though I only intended to focus on the text chunks the same "Latin-1 only" rule also applies to the keyword fields of iCCP and sPLT chunks. Maybe I should shorten the title to "Clarify Latin-1 encoding requirements".

The specification explicitly states the Latin-1 encoding requirement (including the acceptable values) multiple times:

11.3.4.2 Keywords and text strings

Keywords shall contain only printable Latin-1 [ISO-8859-1] characters and spaces; that is, only character codes 32-126 and 161-255 decimal are allowed.
...
In the tEXt and zTXt chunks, the text string associated with a keyword is restricted to the Latin-1 character set plus the linefeed character.
...
The iTXt chunk can be used to convey characters outside the Latin-1 set. It uses the UTF-8 encoding of UCS [ISO/IEC 10646-1].

11.3.3.3 iCCP Embedded ICC profile

... Profile names shall contain only printable Latin-1 characters and spaces (only character codes 32-126 and 161-255 decimal are allowed). Leading, trailing, and consecutive spaces are not permitted

11.3.5.4 sPLT Suggested palette

... Palette names shall contain only printable Latin-1 characters and spaces (only character codes 32-126 and 161-255 decimal are allowed). Leading, trailing, and consecutive spaces are not permitted.

It's not a stretch to believe text fields are like keyword fields but without the whitespace restrictions, the phrase "restricted to the Latin-1 character set" does leave room for interpretation.

So it seems reasonable that an image library says that text handling is left up to the application. libpng will un-compress a xTXt then hand it over.

There is no distinction between library and application in the standard, that is purely a libpng concept. I never considered any part of file validation as optional or up to the application because the standard does not define anything of that sort. The whole point of libraries is to abstract away these details, that bug report (randy408/libspng#123) is the perfect example of why libpng's approach is wrong, it allows UTF-8 text to be encoded in a Latin-1 only zTXt chunk and now everyone else has to deal with it.

It is an interpretation which unfortunately became de-facto standard, it is no longer practical to interpret it any other way. The logical conclusion of that is a library should not discard a non-Latin-1 tEXt/zTXt chunk because any other approach leads to incompatible behavior between libraries. That's why it should be clarified.

To then say that the PNG specification should allow any old 8-bit character encoding, or allow UTF-8 encoded XMP data (which is what your original bug report seems to boil down to) is poor advice.

Only because libpng always allowed it and that interpretation became de-facto standard. Creating a conformant and compatible PNG implementation shouldn't require reading libpng source code, only the specification should be needed for that.

@aphillips
Copy link

@randy408 One way to interpret a Latin-1 restriction is that it is a bit-bucket not in any particular encoding (which would be Very Different from [and in conflict with] my comment just above about using windows-1252). In the past (like 20 years ago when Unicode was less widely supported), many specifications were "internationalized" by allowing users who needed a different encoding (including UTF-8) to pass that encoding through "Latin-1" encoded fields but applying whatever encoding they wanted on the receiving end.

I would be unsurprised if libspng handled "Latin-1" fields in this way--not really enforcing Latin-1 but not providing any character encoding support either. In such a situation, the code wouldn't look at the bytes. It would still be a good idea to have a health warning, such as suggested by @svgeesus, since undifferentiated encodings are a recipe for mojibake.

8859-7 Greek has printable characters outside the Latin-1 ranges (between 126 and 161), most of these character sets share the same subset of non-printable characters which the standard tries to exclude.

Although I agree with the thrust of your comment... this is not exactly correct. 8859-7 Greek has printable characters that use the same byte values as Latin-1 does. You cannot tell if a string is 8859-7 or 8859-1 except by some heuristic ("trying to read the results" as being in some language). The mapping of a Greek 8859 encoding to Unicode certainly includes code points outside of Latin-1, but this is not the same thing.

It should be noted that UTF-8 (and some other non-Latin-1 encodings) do use the range of bytes between 0x80 and 0x9F to encode characters or portions thereof. A strict Latin-1 interpretation would not allow passing UTF-8 or other encodings that use these byte values through the zTXt (or other) fields. In practice, all ASCII-based encodings are the same up through 0x7F, making character encoding agnosticism via Latin-1 possible, provided that the C1 range is just ignored.

@ProgramMax
Copy link
Collaborator

I don't think it is strictly necessary for a library to error on every error-like condition. That violates the idea of being strict with your output but flexible with your input. Especially given the chaotic nature of the web.

FWIW, in the Jul 11th, 2022 meeting we decided against allowing a UTF BOM in the Latin-1 fields. If there is need for UTF-8 in other areas, we should follow the iTXt example. More info in #136

@ProgramMax
Copy link
Collaborator

Looking at the bug report on libspng, randy408/libspng#123 I see the callout that libpng does not validate its input. But I don't think there is a great way to validate.

Each value 0-255 has a meaning in Latin-1. For example, a UTF-8 BOM could also be the Latin-1 . What if the user really intended to write those chars? We shouldn't just tell them "You aren't allowed to start with those characters because it could be interpreted as something different, even though the spec clearly says to interpret as Latin-1."

I think we can close this bug. If someone puts non-Latin-1 in those fields and it doesn't display correctly, that is on them. (And again, if someone wants UTF-8 we should add new chunks for that.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocking-3rd-edition-wd i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs resolved discussion
Projects
None yet
Development

No branches or pull requests

6 participants