-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589
Comments
I'm pretty sure that's just how codepages work, though @miniksa can correct me if I'm wrong. I think if you want to use Did this ever work in previous versions of Windows? |
So it looks like this actually does work outside of WT (!) |
Well that's unfortunate. I wonder if it's a bad MB2WC, or maybe bad input events getting written via win32 input mode. Hmmmmm. |
I think this is ConPTY issue because this can be reproducible also in WSL outside Windows Terminal.
What do you mean by previous versions? Such as Windows 7 or Windows 8.1? Or Windows 10 1903 or earlier? This issue can be reproduced with ConPTY in Windows 10 1809. |
In which version(s) of Windows and Windows Terminal, or with which forms of input, should one expect UTF-8 to work correctly as the input codepage? It's not working with typed and pasted input with conhost.exe in Windows 10 2004, or with openconsole.exe in Windows Terminal Preview 1.4.2652.0. Is there a separate code path for East-Asian locales that use IME input, and does it work correctly in that case? For example, in Python's REPL shell, under Windows Terminal, the following uses ctypes (libffi) to directly call
Note that in both cases "α" gets replaced with a null byte. Setting the output codepage to UTF-8 started working correctly with Setting the input codepage to UTF-8 has never worked correctly with |
Part 1:This is strictly between 437 and 932 per the original filer. I have this spreadsheet from the investigation of how it works in classic conhost since that's what services these API calls: It looks like there is a regression in having the input codepage as 932 and the output codepage as 437... that's a scenario that returns null in the new conhost but not in the legacy Part 2:For Windows Terminal, which uses ConPTY, which tries to use UTF-8/65001 for everything... @DHowett noticed this terminal/src/terminal/adapter/InteractDispatch.cpp Lines 60 to 68 in 55151a4
Where the interactive dispatcher (input for virtual terminal) is using the OUTPUT codepage to determine what to encode with. That's clearly not right either. Summary so far:
I will continue. |
Part 1:The function where this converts the α into 0x00 instead of the two byte sequence 0x83 0xbf is at Line 271 in 9b92986
The α is given to
I'm going to explore item 2 further. Part 1, Item 2The stack where this occurs is
One frame up Code reference: Lines 157 to 208 in 09471c3
The length of space given to I momentarily drilled into The questions then are: |
Part 1, Item 2, Questions
No. That's not the issue. It's the font that's the issue. There's two circumstances I've now observed in If the font says that the
As far as I can tell, this is the long standing mix about that someone did long ago when implementing this. The terminology of "full width" and "double byte" and the number of
It's different because it was using RtlUnicodeToOemN when the system startup ConclusionDue to the total mix around of DBCS and wide/narrow and 1/2 bytes in I believe that the fix here is to make the |
Okay, so I have a proposed fix for the translation part in
It's here: c1ca8f3 But now there's another problem, the parent calling it isn't giving enough buffer space. And that isn't because the user buffer provided in the API call was necessarily too small, it's because the parent function terminal/src/host/readDataCooked.cpp Lines 1003 to 1199 in 9b92986
That sort of behavior is standard practice for COOKED READ.... that is guessing byte count based on character width which can be dependent on the font chosen which can be dependent on the code page selected. Gross. So my plan is to try to make a targeted fix in Also.... |
I wrote a few tests here to cover this as I work on fixing it: |
Feel free to do whatever is necessary to make any good out of the code. Get back to me if you'd like me to work on it any further. Note: There are still codepages where the handling of partials is not supported. They are listed in the method description of |
That's fine. Thank you! |
The overarching intention of this PR is to improve our Unicode support. Most of our APIs still don't support anything beyond UCS-2 and DBCS sequences. This commit doesn't fix the UTF-16 support (by supporting surrogate pairs), but it does improve support for UTF-8 by allowing longer `char` sequences. It does so by removing `TranslateUnicodeToOem` which seems to have had an almost viral effect on code quality wherever it was used. It made the assumption that _all_ narrow glyphs encode to 1 `char` and most wide glyphs to 2 `char`s. It also didn't bother to check whether `WideCharToMultiByte` failed or returned a different amount of `char`s. So up until now it was easily possible to read uninitialized stack memory from conhost. Any code that used this function was forced to do the same "measurement" of narrow/wide glyphs, because _of course_ it didn't had any way to indicate to the caller how much memory it needs to store the result. Instead all callers were forced to sorta replicate how it worked to calculate the required storage ahead of time. Unsurprisingly, none of the callers used the same algorithm... Without it the code is much leaner and easier to understand now. The best example is `COOKED_READ_DATA::_handlePostCharInputLoop` which used to contain 3 blocks of _almost_ identical code, but with ever so subtle differences. After reading the old code for hours I still don't know if they were relevant or not. It used to be 200 lines of code lacking any documentation and it's now 50 lines with descriptive function names. I hope this doesn't break anything, but to be honest I can't imagine anyone having relied on this mess in the first place. I needed some helpers to handle byte slices (`std::span<char>`), which is why a new `til/bytes.h` header was added. Initially I wrote a `buf_writer` class but felt like such a wrapper around a slice/span was annoying to use. As such I've opted for freestanding functions which take slices as mutable references and "advance" them (offset the start) whenever they're read from or written to. I'm not particularly happy with the design but they do the job. Related to #8000 Fixes #4551 Fixes #7589 Fixes #8663 ## Validation Steps Performed * Unit and feature tests ✅ * Far Manager ✅ * Fixes test cases in #4551, #7589 and #8663 ✅
Environment
Steps to reproduce
Expected behavior
If the test-prog above is executed in command prompt, the multibyte char 'α' can be read correctly as follows.
Actual behavior
If the test-prog above is executed in Windows Terminal, the multibyte char 'α' cannot be read correctly as follows.
This also happens with getchar(), getwchar(), gets(), ReadConsoleA(), etc.
As far as I tested, only ReadConsoleW() can read 'α' correctly.
Most of Japanese multibyte chars can be read correctly with the code above as follows
even in Windows Terminal.
Supplement
Similar happens under CP936 (Simplified Chinese) and CP949 (Korean).
The text was updated successfully, but these errors were encountered: