-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify Latin-1 encoding requirements of text chunks #69
Comments
Concretely, if someone drops 8859-7 Greek into a So it seems reasonable that an image library says that text handling is left up to the application. libpng will un-compress a To then say that the PNG specification should allow any old 8-bit character encoding, or allow UTF-8 encoded XMP data (which is what your original bug report seems to boil down to) is poor advice. Instead, if we are giving advice at all, it would be to deprecate Latin-1 chunks in favor of the UTF-8 replacements, which allow any text and which perfectly support Latin text as well.
|
It may be worthy of a separate issue, but should PNG be aligned with the WhatWG Encoding standard which makes ISO 8859-1 (and US ASCII as well) an alias of its superset Windows-1252? |
@Crissov That would make sense: the primary difference between ISO 8859-1 and windows-1252 is that bytes in the range 0x80 through 0x9F are (rarely used) C1 control characters in the former vs. being assigned characters (such as U+20AC, i.e. EURO SIGN) in the latter. If there are bytes in that range, it is almost always much more useful to interpret them as being windows-1252 as a result. |
8859-7 Greek has printable characters outside the Latin-1 ranges (between 126 and 161), most of these character sets share the same subset of non-printable characters which the standard tries to exclude. This isn't limited to the text field but the Latin-1 encoding restriction in general, including the keyword field. The wording for that is more explicit but libpng does not enforce that either when decoding (link 2 in OP), both of these are potential sources of incompatibility e.g. one implementation might discard the chunk while libpng keeps it. Though I only intended to focus on the text chunks the same "Latin-1 only" rule also applies to the keyword fields of iCCP and sPLT chunks. Maybe I should shorten the title to "Clarify Latin-1 encoding requirements". The specification explicitly states the Latin-1 encoding requirement (including the acceptable values) multiple times: 11.3.4.2 Keywords and text strings
11.3.3.3 iCCP Embedded ICC profile
11.3.5.4 sPLT Suggested palette
It's not a stretch to believe text fields are like keyword fields but without the whitespace restrictions, the phrase "restricted to the Latin-1 character set" does leave room for interpretation.
There is no distinction between library and application in the standard, that is purely a libpng concept. I never considered any part of file validation as optional or up to the application because the standard does not define anything of that sort. The whole point of libraries is to abstract away these details, that bug report (randy408/libspng#123) is the perfect example of why libpng's approach is wrong, it allows UTF-8 text to be encoded in a Latin-1 only zTXt chunk and now everyone else has to deal with it. It is an interpretation which unfortunately became de-facto standard, it is no longer practical to interpret it any other way. The logical conclusion of that is a library should not discard a non-Latin-1 tEXt/zTXt chunk because any other approach leads to incompatible behavior between libraries. That's why it should be clarified.
Only because libpng always allowed it and that interpretation became de-facto standard. Creating a conformant and compatible PNG implementation shouldn't require reading libpng source code, only the specification should be needed for that. |
@randy408 One way to interpret a Latin-1 restriction is that it is a bit-bucket not in any particular encoding (which would be Very Different from [and in conflict with] my comment just above about using I would be unsurprised if libspng handled "Latin-1" fields in this way--not really enforcing Latin-1 but not providing any character encoding support either. In such a situation, the code wouldn't look at the bytes. It would still be a good idea to have a health warning, such as suggested by @svgeesus, since undifferentiated encodings are a recipe for mojibake.
Although I agree with the thrust of your comment... this is not exactly correct. 8859-7 Greek has printable characters that use the same byte values as Latin-1 does. You cannot tell if a string is 8859-7 or 8859-1 except by some heuristic ("trying to read the results" as being in some language). The mapping of a Greek 8859 encoding to Unicode certainly includes code points outside of Latin-1, but this is not the same thing. It should be noted that UTF-8 (and some other non-Latin-1 encodings) do use the range of bytes between 0x80 and 0x9F to encode characters or portions thereof. A strict Latin-1 interpretation would not allow passing UTF-8 or other encodings that use these byte values through the zTXt (or other) fields. In practice, all ASCII-based encodings are the same up through 0x7F, making character encoding agnosticism via Latin-1 possible, provided that the C1 range is just ignored. |
I don't think it is strictly necessary for a library to error on every error-like condition. That violates the idea of being strict with your output but flexible with your input. Especially given the chaotic nature of the web. FWIW, in the Jul 11th, 2022 meeting we decided against allowing a UTF BOM in the Latin-1 fields. If there is need for UTF-8 in other areas, we should follow the iTXt example. More info in #136 |
Looking at the bug report on libspng, randy408/libspng#123 I see the callout that libpng does not validate its input. But I don't think there is a great way to validate. Each value 0-255 has a meaning in Latin-1. For example, a UTF-8 BOM could also be the Latin-1 . What if the user really intended to write those chars? We shouldn't just tell them "You aren't allowed to start with those characters because it could be interpreted as something different, even though the spec clearly says to interpret as Latin-1." I think we can close this bug. If someone puts non-Latin-1 in those fields and it doesn't display correctly, that is on them. (And again, if someone wants UTF-8 we should add new chunks for that.) |
The text fields of tEXt, zTXt chunks is supposed to be Latin-1 and UTF-8 text is supposed to go in iTXt chunks, it implies UTF-8 is not allowed for tEXt, zTXt.
11.3.4.3 tEXt Textual data
The reference implementation never verifies the text fields of these chunks(1 2 3), this can lead to compatibility issues when another implementation treats non-Latin-1 text as invalid when writing or reading these chunks (randy408/libspng#123).
If Latin-1 encoding is not enforced in the real world then the standard should clearly state it's only a recommendation.
The text was updated successfully, but these errors were encountered: