Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misuse of wchar_t for UTF-16/32 #180

Open
glebov-andrey opened this issue Mar 16, 2019 · 4 comments
Open

Misuse of wchar_t for UTF-16/32 #180

glebov-andrey opened this issue Mar 16, 2019 · 4 comments

Comments

@glebov-andrey
Copy link

Even though wchar_t does not have a well-defined size in C or C++, LittleCMS uses it to store UTF-16 strings.

The problem is that on platforms with a 32-bit wchar_t (e.g. Linux, OS X), each UTF-16 code unit (e.g. from a file) is simply cast to wchar_t by _cmsReadWCharArray which doesn't produce a valid UTF-32 sequence. To produce a valid UTF-32 sequence one would have to decode each Unicode code point from one or more UTF-16 code units.

To make things worse _cmsWriteWCharArray just down-casts wchar_t to cmsUInt16Number discarding the upper 16 bits of the UTF-32 code unit. This of course does not convert UTF-32 to UTF-16, as the correct conversion would be to decompose each Unicode code point into one or more UTF-16 code units.

Apart from these problems there is also a minor matter of wasted memory - each 32-bit wchar_t has 16 bits unused.

At the very minimum, I would propose documenting the fact that strings created by LittleCMS functions returning wchar_t arrays cannot be used as valid UTF-32 strings on platforms with a 32-bit wchar_t. Instead the user must do their own conversion, using the lower 16 bits of each value to form a valid UTF-16 sequence. And functions that take wchar_t strings should also be documented to take UTF-16 sequences where each code unit is stored in a wchar_t.

Of course, this behaviour is not intuitive or expected by the user and makes handling textual data with LittleCMS more complex (not to mention resource intensive, as each string conversion may require allocating extra memory). For instance, this confusion can cause issues such as #163.

That is why I also propose deprecating the functions that use wchar_t in favour of new functions that operate on some new cmsU16Character type (always 16-bit, name definitely up for discussion), and explicitly state that strings are encoded as UTF-16 in the native byte order. (I would say, just use char16_t but that would require C11/C++11, so a custom type is probably a better choice for compatibility). The new functions could use a UTF16 postfix.
With these new functions the internals would need to be reworked. Instead of storing Unicode text as wchar_t arrays, it should be stored as cmsU16Character arrays. As an added bonus, this would allow for more efficient I/O functions that could process multiple characters at once instead of a character at a time (with the byte-swapping if-statement needing to be called only once for character block).
The deprecated functions would just do the up/down-casting themselves instead of deferring that to the I/O layer.

Another issue that I see is with the ASCII functions as the conversion from UTF-16 to ASCII (cmsMLUgetASCII) is only valid if the original string has no non-ASCII code points in it. A better way to handle this may be to provide the ability to replace non-ASCII code points with a character of the user's choosing (e.g. ?). Of course this would require handling strings as actual sequences of code points instead of opaque code units. (I'm not 100% sure, but this probably doesn't even require fully decoding the UTF-32 code point from UTF-16, just skipping over some of the UTF-16 code units.)

P.S. I would be glad to create a pull-request with some of the proposed changes if a decision is made to proceed with them.

@mm2
Copy link
Owner

mm2 commented Mar 16, 2019 via email

@roelandschoukens
Copy link

Yes, any use of these functions requires a boilerplate to cast those 32-bit characters back to uint16. Otherwise no other library or API will read the strings correctly. Not a big deal but it is pretty unusual.

Not sure how to handle string literals. There is no equivalent in C99 for L"" which gives you 16-bit strings. C11 introduces u"" literals (which are arrays of char16_t).

“the ASCII functions [are] only valid if the original string has no non-ASCII code points in it” → correct. Don't use them if you are not 100% sure there are only ASCII characters.

@levicki
Copy link

levicki commented Jun 17, 2019

Would it be possible/easier to solve this using UTF-8 on all platforms?

@mm2
Copy link
Owner

mm2 commented Jan 7, 2022

2.13 has some code to fix this issue. I left it open because it misses the write part, but the engine should now read and convert ICC's utf16 into utf32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants