-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 #1288
base: master
Are you sure you want to change the base?
Conversation
… properly Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff
At code read level this looks good, but I still need to test it out somehow to make sure the mappings are good. By the way, I'm thinking if it is not easier to just do a [128] array for all codepages and simplify the code. |
Checked the cp1251 logic – looks good. I haven't actually verified it, but the previously supported characters should keep working as they did. |
Doing this is definitely an option, and would also simplify the FromUTF8 function as well. |
This also makes me wonder a few things: Also, should I add supoort in the ToUTF8 function for outputting unicode characters above 0xFFFF (i.e. 4 bytes utf-8, for unicode character codes 0x10000 thru 0x10FFFF)? This would allow (in theory) some characters to be rendered as emojis etc. |
I'll try to explain it: A utf-8 character with unicode character code <= 0x7f is just sent as 0x00-0x7f (0b00000000 thru 0b01111111) A utf-8 character with unicode character code >= 0x80 and <= 0x7ff is broken into two bytes, sent one after the other:
A utf-8 character with unicode character code >= 0x800 and <= 0xffff is broken into three bytes, sent one after the other:
A utf-8 character with uncode character code >= 0x10000 and <= 0x10ffff (in theory this could be as high as 0x1fffff but unicode defined the max range as 0x10ffff for compatibility with UTF-16) is broken into four bytes, sent one after the other:
|
No point in adding features that won't be used. As far as I know, cp1250, cp1251 and cp1252 none have 4-byte UTF8 symbols (do they even have 3 byte symbols?) and we don't support anything else. Same with codepoint overrides, it just adds much complexity that I doubt anyone will need. And if they do, it's a simple isolated change to modify the mapping table, so they can just maintain it in their fork. I think it makes sense to move the ToUTF8 encoding to be a |
Yes, there are several 3 byte symbols already: 0x2026 (horizontal ellipsis) for instance, is translated to by cp1252 0x85 (which is used many times in dialog.tlk), and is a 3-byte utf-8 character: 0xE2 0x80 0xA6 |
…odepages, and fixed FromUTF8 to also search the entirety of these tables.
Fixed cp1252 characters from 0x80-0x9f not being converted to unicode properly
Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode
Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff