Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure WM_CHAR functions properly when _UNICODE is defined. #260

Conversation

clangen
Copy link
Contributor

@clangen clangen commented Dec 27, 2022

Note: I think that this patch in its current form may not be something you'd want to merge; it may not even be something you want to merge at all, but I wanted to bring it up...

My app that uses PDCursesMod is compiled with the Use Unicode Character Set option with MSVC++, which will ensure all Win32 API calls use the wide-character variants, all of which use UTF16 encoded characters and strings regardless of the user's selected locale.

That means that when _UNICODE is defined the character sent to the WM_CHAR message is UTF16 encoded -- and PDCursesMod adds it to key queue directly via add_key_to_queue((int) wParam); My app defines PDC_FORCE_UTF8, so I was assuming that internally this character would be converted to a sequence of UTF8 bytes to be received by getch() calls.

However, this was not the case, and parsing some non-ascii characters was problematic. To work around this, I added a patch that converts the UTF16 character received by WM_CHAR into a sequence of UTF8 bytes, and adds them to the key queue one at a time.

this case, we'll convert them to a UTF8 byte sequence and enqueue the
characters individually.
@Bill-Gray
Copy link
Owner

Hmmm... I think I see what you mean (took me a little while; I was slow on the uptake here.) Hit, say, € (Euro sign), and getch() should expand that into a three-byte UTF8 sequence (and that is what ncurses does). get_wch() should just return the Unicode point (0x20AC). (Or, on benighted systems such as Microsoft Windows, whatever code point corresponds to € in the current character set.)

Not entirely sure how to handle this. One possibility would be to have some logic along the following lines:

Let's say you call getch(), and the return value would be non-ASCII and not a "special key" (i.e., it's less than KEY_MIN or greater than KEY_MAX). For the example € = U+20AC, it should be expanded into the three-byte sequence E2, 82, AC. (Or, on MSWin, whatever WideCharToMultiByte() gives you... should be the above sequence if you're currently set to UTF8, or if PDC_FORCE_UTF8=Y is specified.)

So we should remove the U+20AC from the front of the queue and replace it with 82 and AC, and return E2. The next two calls to getch() will retrieve the 82 and the AC, and it'll be up to you, the caller, to take the three bytes and reassemble them into U+20AC.

One small catch : those next two calls may say, "huh, 82 (or AC) is non-ASCII; let's UTF8-ize them." Shouldn't be too tough to keep track of how many decoded bytes are at the start of the queue, though.

That's the product of a few minute's thinking after finally realizing what you were trying to fix here, though. I reserve the right to disavow/revise the above upon further thinking (or having my errors pointed out to me.)

@GitMensch
Copy link
Collaborator

CI fails with

C:\projects\pdcursesmod\wingui\pdcscrn.c(1922): Error! E1063: Missing operand
C:\projects\pdcursesmod\wingui\pdcscrn.c(1947): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(1948): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(1956): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(1959): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(1972): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(1983): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(1984): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(2009): Error! E1001: CASE must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(2016): Error! E1003: DEFAULT must appear in switch statement
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1009: Expecting ')' but found 'void'
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1026: Invalid declarator
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1009: Expecting ';' but found '0'
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1061: Expecting data or function declaration, but found 'constant'
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1026: Invalid declarator
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1009: Expecting ';' but found ')'
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1061: Expecting data or function declaration, but found ')'
C:\projects\pdcursesmod\wingui\pdcscrn.c(2020): Error! E1026: Invalid declarator
C:\projects\pdcursesmod\wingui\pdcscrn.c(2021): Error! E1009: Expecting ')' but found '('
C:\projects\pdcursesmod\wingui\pdcscrn.c(2021): Error! E1009: Expecting ')' but found '('
C:\projects\pdcursesmod\wingui\pdcscrn.c(2021): Error! E1147: Too many errors: compilation aborted
Error: Compiler returned a bad status compiling "C:\projects\pdcursesmod\wingui\pdcscrn.c"

So you may want to add somewhere:

#ifndef _UNICODE
#define _UNICODE 0
#endif

or check for its definition (whatever is done otherwise)

@Bill-Gray
Copy link
Owner

Before going too deeply into fixing this patch, I'd warn that I don't expect to use it.

It's not quite finished yet, but this patch is (intended to) fix the problem on all platforms, not just WinCon or WinGUI. The idea is that the various platforms continue to queue up the "actual" characters, and wget_wch() returns those; wgetch() returns multi-byte sequences. No platform-specific code is involved.

It works correctly at present in wide-character builds; I need to shuffle things slightly for 8-bit character builds. Shouldn't be difficult; I've just been short on time lately... it'll also mean that PDC_wcstombs() will have to be compiled for 8-character builds; at present, it's a wide-character function only.

@clangen
Copy link
Contributor Author

clangen commented Jan 7, 2023

Ah, yeah, I sorta forgot about WinCon. Given that my patch is incomplete and there is another upstream patch in progress, I think we can close this ticket.

@okibcn
Copy link
Contributor

okibcn commented Jan 19, 2023

I am facing something similar. I am using Ubuntu to cross-compile for windows using mingw32. Duplicating the issue is very easy. Creating a minimum setup for this is easy:

  • 1 Setup the environment with these packages:
sudo -E apt -qq update && sudo apt upgrade -y
sudo -E apt -qq install -y autoconf automake autopoint gcc mingw-w64 gettext git groff make pkg-config texinfo p7zip-full
  • 2 Use the latest PDCurserMod from git
git clone https://github.com/Bill-Gray/PDCursesMod.git
cd PDCursesMod/wingui
make WIDE=Y UTF8=Y _w64=Y demos
  • 3 create a file with the name utf8.c in that folder with this content:
#include <curses.h>

int     quit;
wint_t  c;
wchar_t wbuff[2];

int main()
{
    initscr();
    noecho();
    keypad( stdscr, TRUE);
    while( !quit)
    {
        get_wch(&c);
        if( c == KEY_F(1) || c == 27){
            quit = 1;
        }else{
            wbuff[0] = c;
            wbuff[1] = '\0';
            mvaddwstr( 0,1, wbuff);
            mvprintw( 1, 1, "U+%08x ", c);
            refresh();
        }
    }
    endwin();
    return 0;
}
  • 4 Build the program with:
x86_64-w64-mingw32-gcc -Wall -Wextra -pedantic -O3 -I.. -DPDC_FORCE_UTF8 -mwindows -outf8.exe ./utf8.c pdcurses.a -lgdi32 -lcomdlg32 -lwinmm
  • 5 Copy the utf8.exe file to a Windows system. In my case, I am testing in Windows 11.

Just type any char and the program will reprint it and write the unicode. It works up to a certain point. If you introduce an emoji like this 😀 with unicode U+0001F600 — using an unicode above U+0000FFFF — PDCursesMod fails. It fails to detect the code, and obviously, it can´t print it properly. An easy way to type emojis directly in Windows 11 is by pressing 🪟+/ shortcut.

With a more extended code you can see that PDCursesMod is actually delivering two codes in the buffer 0xd83d 0xde00 corresponding to the UTF-16 encoding of U+0001F600.

I hope this clarifies the issue with the PDCursesMod support of UTF8 in Windows. I have tested the same behavior using the wincon flavor of PDCursesMod.

@okibcn
Copy link
Contributor

okibcn commented Jan 19, 2023

Probably this code can illustrate better the problem with wide char management in PDCursesMod.

Just use the steps mentioned in my previous comment. but in step 3 use this code:

#include <curses.h>

int     i=0, quit=0;
wint_t  c, wbuff[3];
unsigned char *address;

int main()
{
    initscr();
    noecho();
    keypad( stdscr, TRUE);
    wbuff[1] = ' ';
    while( !quit)
    {
        get_wch(&c);
        if( c == KEY_F(1) || c == 27){
            quit = 1;
        }else{
            wbuff[0]=wbuff[1];
            wbuff[1] = c;
            wbuff[2] = '\0';
            mvaddwstr( 0,1, wbuff);
            mvprintw( 1, 1, "U+%08x U+%08x ", wbuff[0], wbuff[1]);
            mvprintw( 3, 1, "Size of wchar_t is %d bytes", sizeof(wchar_t));
            mvprintw( 4, 1, "Size of Wint_t is %d bytes", sizeof(wint_t));
            mvprintw( 5, 1, "Size of int is %d bytes", sizeof(int));
            mvprintw( 7, 1, "Memory:");
            address=&wbuff[0];
            for (i = 0; i < 3*sizeof(wint_t); i++){
                mvprintw( 8, 1+3*i, "%.2x ", address[i]);
            }
            refresh();
        }
    }
    endwin();
    return 0;
}

If you use a font capable of displaying emojis such as Consolas you can see the problems:

  1. wchar_t can't store any unicode because it is too narrow, it should have at least 20 bits to be able to store any unicode. a 32bits int would be fine and it could even store encoded unicode.
  2. PDCurses is returning and storing the encoded unicode. But in windows it is using UTF-16 to encode unicode in memory.

If you type the emoji using the Windows unicode tool (🪟+/) and clicking on the 'grinning face' emoji 😀, you can see how it prints the emoji, but then it prints the next wchar without removing it from the buffer along with the first one.

image

Read and write chars are not consistent. affecting how strings are stored.

@GitMensch
Copy link
Collaborator

It's not quite finished yet, but this patch is (intended to) fix the problem on all platforms, not just WinCon or WinGUI.

@okibcn Did you gave that changed version a try? In any case: thanks for the additional test code and scenario.

@okibcn
Copy link
Contributor

okibcn commented Jan 19, 2023

It's not quite finished yet, but this patch is (intended to) fix the problem on all platforms, not just WinCon or WinGUI.

@okibcn Did you gave that changed version a try? In any case: thanks for the additional test code and scenario.

@GitMensch I am afraid that the code still fails. I am not attaching another screenshot as it is exactly the same as before. The problem is wider as it affects not only getch. The wide string printing functions, and the way Unicode chars are stored in memory, are affected too. I think that it is a deeper problem.

@okibcn
Copy link
Contributor

okibcn commented Jan 22, 2023

@GitMensch, I want to come back to the UTF-8 entry support for all the PDCursesMod flavors. The patch you mentioned has two main problems:

  1. works partially, but not in all cases. It also doesn't consider multiunit UTF-16 inputs, losing any support for emojis or any other codepoint beyond 64K.
  2. It has a dangerous recursive function call in line 758.

I have got very good results replacing the function in that patch by this one:

/* deliver a byte sequence in UTF-8 or UTF-16 encoding.*/
/* Supports supplentary unicode planes up to 20 bits (UTF-16). */
int wgetch(WINDOW *win)
{
    static union
    {
        wchar_t word[2];
        char byte[4];
    } buff;
    static size_t n_buff = 0;
    int rval = 0;

    if (!n_buff) /* Empty buffer, no pending bytes */
    {
        rval = raw_wgetch(win);
        do
        {
            if (rval < 0)
                rval = ERR;
            if(  rval == ERR || (rval < 0x80) || PDC_is_function_key(rval))
                return rval;
            buff.word[0] = rval;
            if ( (rval & 0xF800) != 0xD800) /* single-unit UTF-16 unicode codepoint */
                n_buff = 2;
            else     /* Potential multiunit UTF-16 code */
            {
                rval = raw_wgetch(win);             /* get another UTF-16 unit from buffer*/
                if ( (rval & 0xFC00) != 0xDC00)     /* mask check of second UTF-16 unit */
                {
                    unget_wch((wchar_t)rval);
                    return ERR;
                }                
                buff.word[1]=rval;
                n_buff = 4;
                rval = (((int)(buff.word[0] & 0x3FF) << 10) | (rval & 0x3FF)) + 0x10000; /* rval is the full unicode */
            }
        } while (!n_buff);
#ifdef PDC_FORCE_UTF8
        n_buff = PDC_wc_to_utf8(buff.byte, rval);  /* rval has the raw unicode codepoint */
#endif
    }
    size_t i;
    rval = buff.byte[0] & 0xFF;
    n_buff--;
    for (i = 0; i < n_buff; i++)
        buff.byte[i] = buff.byte[i + 1];
    return (rval);
}

int get_wch(wint_t *wch)
{
    PDC_LOG(("get_wch() - called\n"));

    return wget_wch(stdscr, wch);
}

Obviously, you will need to add the prototype int PDC_wc_to_utf8(char *dest, const int32_t code); before that function for testing,. In any case, I miss that prototype in curspriv.h.

This solves the UTF-8 input for all the PDCursesMod flavors, but there is still another problem. PDCursesMod is not capable to print all the supplemental pages beyond unicode codepoint 64K. Basically you need both to properly use it.

@Bill-Gray and @GitMensch, I am modifying other functions to desabiguate wc (widechar) from UC (unicode). In your code it seems that you are using both as wchar_t, but a codepoint sometimes requires more than one wchar_t.

@GitMensch
Copy link
Collaborator

@Bill-Gray - just to recheck: did you worked on your unfinished patch to fix it on all platforms, possibly with including the function replacement suggested by @okibcn?

For now I've created a bug issue for this and - as asked by @clangen - close this PR in the hope for the patch being finished and checked in.

@GitMensch GitMensch closed this Mar 4, 2023
Bill-Gray added a commit that referenced this pull request Mar 4, 2023
…g characters past 256 should be expanded into multi-byte strings.
@Bill-Gray
Copy link
Owner

I refer to the fix in commit bd50b2b as 'partial' because in non-wide mode, getch() (on some platforms) doesn't break up wide characters into a stream of bytes. Fixes a lot, though, and should put us on the road to a complete fix.

@okibcn
Copy link
Contributor

okibcn commented Mar 5, 2023

This should fix it completely. Already tested and working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants