Decoupling Curses WCHAR_T from Windows WCHAR #286

okibcn · 2023-03-05T00:20:22Z

okibcn
Mar 5, 2023

HISTORY

Historically, Curses lib was designed to simplify the ANSI interface, adding support for basic text windows. It had two modes to interface with the underlying hardware interface:
CHAR: Single byte (8 bits) interface supporting only 0-255 ANSI code with complicated codepage, etc. we all know the past.
WCHAR: 4-Byte (32 bits) interface supporting all the Unicode codepoints, not requiring codepages, supporting at the same time international writing, emojis, etc. The 32 bits were storing the Unicode codepage, while the underlying interface was still using byte as the communication unit. This made it very easy to integrate UTF-8 as the default string instead of UTF-32 or the complex UTF-16.

Windows took a different approach, and instead of defining WCHAR as UTF-32, Microsoft decided that the 16 bits solution was enough. But soon after it was demonstrated that 64K codespaces were not enough. This is how UTF-16 evolved to store the 21 bits of the current Unicode codespace using two unit messages, rendering the whole point of UTF-16 useless, as it was created to ensure constant length symbols.

THE PROBLEM

While Curses supported UTF-8 as a low-level interface with strings, files, etc, while maintaining 32 bits for the WCHAR_T type, the Windows version mixed these two concepts, and implemented the underlying UTF-16 interface mode WCHAR with the Curses internal WCHAR_T type, making it 16 bit long, thus incapable to store the whole Unicode set of characters, affecting those supplemental pages above 0xFFFF.

PDCurses, And PDCursesMod inherited this genetic defect as they were born as an implementation for the original Curses implementation.

THE SOLUTION?

PDCurses and PDCursesMod have nice decoupling between the main library and the lower-level interface with the OS. This makes it relatively very easy to decouple the Curses level 32 bit WCHAR_T type, and the low level 16 bit WCHAR when dealing with old Windows API.

Note the new Windows API supports unicode even when using the byte interface, something you can easily test by compiling this simple C code:

#include <stdio.h>
#include <windows.h>
int main() {
  SetConsoleOutputCP(65001);
  const char unicode_text[]="aäábcdefghijklmnoöpqrsßtuüvwxyz€😀";
  printf("%s\n", unicode_text);
}

By doing this, it is extremely easy to implement UTF-8 at all levels, supporting UTF-8 and full Unicode even when the OS interface is communicating char by char.

Would it work? Yes. It would not only work, but it IS already working in Nano for Windows.

I had to implement this modification to provide support for the supplemental codepages, including additional symbols for traditional Chinese and other languages above 0xFFFF, including emojis. See it working here:

It would need some additional work, as I only modified the definition of the wchar_t type as int. This would require a deeper review, decoupling all those places where wchar_t is also used at the OS level, like the clipboard management for instance. There are several other places, but for basic input-output, it works as you can see in the mentioned implementation of nano for Windows.

@Bill-Gray, @GitMensch, @juliusikkala, @rmorales87atx, @shugo and the rest of people interested in improving this library. I would like to open a discussion about this so everyone could provide their insights about this proposal.

Looking forward to hearing your comments.

juliusikkala · 2023-03-05T14:33:26Z

juliusikkala
Mar 5, 2023

I'm doing most of my programming on Linux, so I don't currently have a good perspective on the issues that befall Windows. If this changes the Windows version to behave more like on Unix, then I'm for it. I'm currently working on a small game project for fun, and it spams the screen full of supplementary plane characters, so it sounds like this would be vital to get that working on Windows.

Does this affect other backends than Wincon & wingui?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupling Curses WCHAR_T from Windows WCHAR #286

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Decoupling Curses WCHAR_T from Windows WCHAR #286

okibcn Mar 5, 2023

HISTORY

THE PROBLEM

THE SOLUTION?

Replies: 1 comment

juliusikkala Mar 5, 2023

okibcn
Mar 5, 2023

juliusikkala
Mar 5, 2023