Misuse of wchar_t for UTF-16/32 #180

glebov-andrey · 2019-03-16T18:18:42Z

Even though wchar_t does not have a well-defined size in C or C++, LittleCMS uses it to store UTF-16 strings.

The problem is that on platforms with a 32-bit wchar_t (e.g. Linux, OS X), each UTF-16 code unit (e.g. from a file) is simply cast to wchar_t by _cmsReadWCharArray which doesn't produce a valid UTF-32 sequence. To produce a valid UTF-32 sequence one would have to decode each Unicode code point from one or more UTF-16 code units.

To make things worse _cmsWriteWCharArray just down-casts wchar_t to cmsUInt16Number discarding the upper 16 bits of the UTF-32 code unit. This of course does not convert UTF-32 to UTF-16, as the correct conversion would be to decompose each Unicode code point into one or more UTF-16 code units.

Apart from these problems there is also a minor matter of wasted memory - each 32-bit wchar_t has 16 bits unused.

At the very minimum, I would propose documenting the fact that strings created by LittleCMS functions returning wchar_t arrays cannot be used as valid UTF-32 strings on platforms with a 32-bit wchar_t. Instead the user must do their own conversion, using the lower 16 bits of each value to form a valid UTF-16 sequence. And functions that take wchar_t strings should also be documented to take UTF-16 sequences where each code unit is stored in a wchar_t.

Of course, this behaviour is not intuitive or expected by the user and makes handling textual data with LittleCMS more complex (not to mention resource intensive, as each string conversion may require allocating extra memory). For instance, this confusion can cause issues such as #163.

That is why I also propose deprecating the functions that use wchar_t in favour of new functions that operate on some new cmsU16Character type (always 16-bit, name definitely up for discussion), and explicitly state that strings are encoded as UTF-16 in the native byte order. (I would say, just use char16_t but that would require C11/C++11, so a custom type is probably a better choice for compatibility). The new functions could use a UTF16 postfix.
With these new functions the internals would need to be reworked. Instead of storing Unicode text as wchar_t arrays, it should be stored as cmsU16Character arrays. As an added bonus, this would allow for more efficient I/O functions that could process multiple characters at once instead of a character at a time (with the byte-swapping if-statement needing to be called only once for character block).
The deprecated functions would just do the up/down-casting themselves instead of deferring that to the I/O layer.

Another issue that I see is with the ASCII functions as the conversion from UTF-16 to ASCII (cmsMLUgetASCII) is only valid if the original string has no non-ASCII code points in it. A better way to handle this may be to provide the ability to replace non-ASCII code points with a character of the user's choosing (e.g. ?). Of course this would require handling strings as actual sequences of code points instead of opaque code units. (I'm not 100% sure, but this probably doesn't even require fully decoding the UTF-32 code point from UTF-16, just skipping over some of the UTF-16 code units.)

P.S. I would be glad to create a pull-request with some of the proposed changes if a decision is made to proceed with them.

The text was updated successfully, but these errors were encountered:

mm2 · 2019-03-16T22:37:16Z

Thank you for your comments. As you note, lcms use wchar_t to store utf16 codes and this is the only assumption it does. This is clearly stated in the manual that the char width is 16 bits. See MLU documentation. Utf32 is not supported, nor even mentioned in the documentation. wchar_t was chosen as transport type because at least can hold 16 bits and in some platforms it simplifies string handling. On other platforms, it makes some characters accessible with few effort (those in range 0..65536). In any case it doesnt intend to be a complete solution. I found many compatibility problems with string conversion functions, so I remove all of them for simplicity sake.Lcms2 development is on freeze and no new features will be incorporated right now. Otherwise your suggestions makes a lot of sense and for sure I will have them very present on future developments.Thanks again

roelandschoukens · 2019-03-31T23:57:51Z

Yes, any use of these functions requires a boilerplate to cast those 32-bit characters back to uint16. Otherwise no other library or API will read the strings correctly. Not a big deal but it is pretty unusual.

Not sure how to handle string literals. There is no equivalent in C99 for L"" which gives you 16-bit strings. C11 introduces u"" literals (which are arrays of char16_t).

“the ASCII functions [are] only valid if the original string has no non-ASCII code points in it” → correct. Don't use them if you are not 100% sure there are only ASCII characters.

levicki · 2019-06-17T22:21:03Z

Would it be possible/easier to solve this using UTF-8 on all platforms?

mm2 · 2022-01-07T15:20:38Z

2.13 has some code to fix this issue. I left it open because it misses the write part, but the engine should now read and convert ICC's utf16 into utf32.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misuse of wchar_t for UTF-16/32 #180

Misuse of wchar_t for UTF-16/32 #180

glebov-andrey commented Mar 16, 2019

mm2 commented Mar 16, 2019 via email

roelandschoukens commented Mar 31, 2019

levicki commented Jun 17, 2019

mm2 commented Jan 7, 2022

Misuse of wchar_t for UTF-16/32 #180

Misuse of wchar_t for UTF-16/32 #180

Comments

glebov-andrey commented Mar 16, 2019

mm2 commented Mar 16, 2019 via email

roelandschoukens commented Mar 31, 2019

levicki commented Jun 17, 2019

mm2 commented Jan 7, 2022