Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 #1288

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Lord-Nightmare
Copy link

Fixed cp1252 characters from 0x80-0x9f not being converted to unicode properly
Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode
Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff

… properly

Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode
Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff
@mtijanic
Copy link
Member

At code read level this looks good, but I still need to test it out somehow to make sure the mappings are good.

By the way, I'm thinking if it is not easier to just do a [128] array for all codepages and simplify the code.

@virusman
Copy link
Member

virusman commented Mar 15, 2021

Checked the cp1251 logic – looks good. I haven't actually verified it, but the previously supported characters should keep working as they did.
I don't really understand all the shifting and other arithmetic with push_back, so can't verify that.

@Lord-Nightmare
Copy link
Author

By the way, I'm thinking if it is not easier to just do a [128] array for all codepages and simplify the code.

Doing this is definitely an option, and would also simplify the FromUTF8 function as well.

@Lord-Nightmare
Copy link
Author

This also makes me wonder a few things:
Should the users have control over these tables, so for instance, if you want a font to use a hair space instead of a space character, you could change an entry in a [256] long conversion table to use U+2009 THIN SPACE or U+200A HAIR SPACE in place of U+0020 SPACE, if it would look better in a specific font.

Also, should I add supoort in the ToUTF8 function for outputting unicode characters above 0xFFFF (i.e. 4 bytes utf-8, for unicode character codes 0x10000 thru 0x10FFFF)? This would allow (in theory) some characters to be rendered as emojis etc.

@Lord-Nightmare
Copy link
Author

I don't really understand all the shifting and other arithmetic with push_back, so can't verify that.

I'll try to explain it:

A utf-8 character with unicode character code <= 0x7f is just sent as 0x00-0x7f (0b00000000 thru 0b01111111)

A utf-8 character with unicode character code >= 0x80 and <= 0x7ff is broken into two bytes, sent one after the other:

  1. 0b110xxxxx with the high 5 bits (bits 6 thru 10 lsb to msb) of the character code replacing the 'x's
  2. 0b10xxxxxx with the low 6 bits (bits 0 thru 5 lsb to msb) of the character code replacing the 'x's

A utf-8 character with unicode character code >= 0x800 and <= 0xffff is broken into three bytes, sent one after the other:

  1. 0b1110xxxx with the high 4 bits (bits 12 thru 15 lsb to msb) of the character code replacing the 'x's
  2. 0b10xxxxxx with the next 6 bits (bits 5 thru 11 lsb to msb) of the character code replacing the 'x's
  3. 0b10xxxxxx with the low 6 bits (bits 0 thru 5 lsb to msb) of the character code replacing the 'x's

A utf-8 character with uncode character code >= 0x10000 and <= 0x10ffff (in theory this could be as high as 0x1fffff but unicode defined the max range as 0x10ffff for compatibility with UTF-16) is broken into four bytes, sent one after the other:

  1. 0b11110xxx with the high 3 bits (bits 18 thru 20 lsb to msb) of the character code replacing the 'x's
  2. 0b10xxxxxx with the next 6 bits (bits 12 thru 17 lsb to msb) of the character code replacing the 'x's
  3. 0b10xxxxxx with the next 6 bits (bits 5 thru 11 lsb to msb) of the character code replacing the 'x's
  4. 0b10xxxxxx with the low 6 bits (bits 0 thru 5 lsb to msb) of the character code replacing the 'x's

@mtijanic
Copy link
Member

No point in adding features that won't be used. As far as I know, cp1250, cp1251 and cp1252 none have 4-byte UTF8 symbols (do they even have 3 byte symbols?) and we don't support anything else.

Same with codepoint overrides, it just adds much complexity that I doubt anyone will need. And if they do, it's a simple isolated change to modify the mapping table, so they can just maintain it in their fork.

I think it makes sense to move the ToUTF8 encoding to be a [128] array everywhere. For FromUTF8, we can probably just dynamically generate another array from the previous one to avoid iterating for every character. I'm not too worried about wasting 64kb on a sparse table. (though it remains to be seen if that is faster that iteration, due to cache effects)

@Lord-Nightmare
Copy link
Author

Lord-Nightmare commented Mar 17, 2021

No point in adding features that won't be used. As far as I know, cp1250, cp1251 and cp1252 none have 4-byte UTF8 symbols (do they even have 3 byte symbols?) and we don't support anything else.

Yes, there are several 3 byte symbols already: 0x2026 (horizontal ellipsis) for instance, is translated to by cp1252 0x85 (which is used many times in dialog.tlk), and is a 3-byte utf-8 character: 0xE2 0x80 0xA6

…odepages, and fixed FromUTF8 to also search the entirety of these tables.
@Lord-Nightmare Lord-Nightmare changed the title Various fixes for converting cp1252, cp1251, cp1250 characters to utf-8 Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8 Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants