Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589

tyan0 · 2020-09-10T08:55:24Z

Environment

Windows build number: Microsoft Windows [Version 10.0.18363.1082]
Windows Terminal version (if applicable): 1.2.2381.0

Steps to reproduce

Compile the code below.

/* test-prog.c */
#include <windows.h>
#include <stdio.h>

int main()
{
        unsigned char buf[256];
        DWORD len;
        int i;
        ReadFile(GetStdHandle(STD_INPUT_HANDLE), buf, sizeof(buf), &len, NULL);
        for (i=0; i<len; i++) printf("%02x ", buf[i]);
        printf("\n");
        return 0;
}

Open Windows Terminal with cmd.exe or powershell.exe.
Run chcp 932.
Run the test-prog above.
Enter 'α' + [enter key].

Expected behavior

If the test-prog above is executed in command prompt, the multibyte char 'α' can be read correctly as follows.

C:\Test>test-prog
α
83 bf 0d 0a

Actual behavior

If the test-prog above is executed in Windows Terminal, the multibyte char 'α' cannot be read correctly as follows.

C:\Test>test-prog
α
00 0d 0a

This also happens with getchar(), getwchar(), gets(), ReadConsoleA(), etc.
As far as I tested, only ReadConsoleW() can read 'α' correctly.

Most of Japanese multibyte chars can be read correctly with the code above as follows

C:\Test>test-prog
あ
82 a0 0d 0a

even in Windows Terminal.

Supplement

Similar happens under CP936 (Simplified Chinese) and CP949 (Korean).

The text was updated successfully, but these errors were encountered:

zadjii-msft · 2020-10-09T14:12:23Z

I'm pretty sure that's just how codepages work, though @miniksa can correct me if I'm wrong. I think if you want to use ReadFile (NOT ReadFileW), you'll probably want to stick to UTF-8 (codepage 65001).

Did this ever work in previous versions of Windows?

DHowett · 2020-10-09T16:04:11Z

So it looks like this actually does work outside of WT (!)

zadjii-msft · 2020-10-09T16:23:05Z

Well that's unfortunate. I wonder if it's a bad MB2WC, or maybe bad input events getting written via win32 input mode. Hmmmmm.

tyan0 · 2020-10-09T17:31:49Z

I think this is ConPTY issue because this can be reproducible also in WSL outside Windows Terminal.

Did this ever work in previous versions of Windows?

What do you mean by previous versions? Such as Windows 7 or Windows 8.1? Or Windows 10 1903 or earlier?
Since Windows Terminal supports only in Windows 1903 or later, This cannot be reproducible in Windows 7 and 8.1.

This issue can be reproduced with ConPTY in Windows 10 1809.

eryksun · 2020-10-10T03:09:03Z

I think if you want to use ReadFile (NOT ReadFileW), you'll probably want to stick to UTF-8 (codepage 65001).

In which version(s) of Windows and Windows Terminal, or with which forms of input, should one expect UTF-8 to work correctly as the input codepage? It's not working with typed and pasted input with conhost.exe in Windows 10 2004, or with openconsole.exe in Windows Terminal Preview 1.4.2652.0. Is there a separate code path for East-Asian locales that use IME input, and does it work correctly in that case?

For example, in Python's REPL shell, under Windows Terminal, the following uses ctypes (libffi) to directly call ReadConsoleA and ReadFile:

>>> import ctypes
>>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
>>> kernel32.SetConsoleCP(65001)
1
>>> kernel32.GetConsoleCP()
65001
>>> h = kernel32.GetStdHandle(-10)
>>> buf = (ctypes.c_char * 10)()
>>> pn = ctypes.pointer(ctypes.c_ulong())

>>> kernel32.ReadConsoleA(h, buf, 10, pn, None)
ab_α_ba
1
>>> buf[:9]
b'ab_\x00_ba\r\n'

>>> kernel32.ReadFile(h, buf, 10, pn, None)
cd_α_dc
1
>>> buf[:9]
b'cd_\x00_dc\r\n'

Note that in both cases "α" gets replaced with a null byte.

Setting the output codepage to UTF-8 started working correctly with WriteFile and WriteConsoleA in Windows 8, due to the required rewrite when the old LPC-based 'files' were replaced by real files provided by the ConDrv device. With the old implementation in Windows 7 and earlier, writing UTF-8 returned the number of decoded wide characters written instead of the number of bytes written. The latter caused buffered writers to repeat a write multiple times, leading to the display of garbage text after every write with non-ASCII characters.

Setting the input codepage to UTF-8 has never worked correctly with ReadFile and ReadConsoleA in a Western locale in any version of Windows or the console host that I've used. It's limited to reading 7-bit ASCII (i.e. ordinals 0-127). Each non-ASCII character gets replaced by a null character. Since chcp.com sets both the input and output codepages, the advice to use chcp 65001, which gets repeatedly online ad nauseum, is generally misguided. It's a bad choice for people who need to read non-English input (e.g. a Spanish locale), either typed or pasted into the console. It's fine in Windows 8+ if just the output codepage is set via SetConsoleOutputCP(CP_UTF8).

miniksa · 2020-10-16T20:27:40Z

Part 1:

This is strictly between 437 and 932 per the original filer.

I have this spreadsheet from the investigation of how it works in classic conhost since that's what services these API calls:

It looks like there is a regression in having the input codepage as 932 and the output codepage as 437... that's a scenario that returns null in the new conhost but not in the legacy conhostv1.dll. So I need to get that fixed.

Part 2:

For Windows Terminal, which uses ConPTY, which tries to use UTF-8/65001 for everything...

@DHowett noticed this

terminal/src/terminal/adapter/InteractDispatch.cpp

Lines 60 to 68 in 55151a4

    
           bool InteractDispatch::WriteString(const std::wstring_view string) 
        
           { 
        
               if (string.empty()) 
        
               { 
        
                   return true; 
        
               } 
        
               unsigned int codepage = 0; 
        
               bool success = _pConApi->GetConsoleOutputCP(codepage);

Where the interactive dispatcher (input for virtual terminal) is using the OUTPUT codepage to determine what to encode with. That's clearly not right either.

Summary so far:

I have a test/investigation table for conhost, but no line of code to point to as broken
I have a line of code broken in conpty (for terminal), but don't have a specific test/investigation for it yet.

I will continue.

miniksa · 2020-10-19T23:16:01Z

Part 1:

The function where this converts the α into 0x00 instead of the two byte sequence 0x83 0xbf is at

terminal/src/host/misc.cpp

Line 271 in 9b92986

    
           return LOG_IF_WIN32_BOOL_FALSE(WideCharToMultiByte(uiCodePage, 0, pwchSource, cchSource, pchTarget, cchTarget, nullptr, nullptr));

.

The α is given to WideCharToMultiByte as 1 byte and asked to convert to codepage 932 with only 1 byte of available output space. Since it obviously needs 2 in that code page, it's converted to 0x00.

It should probably be getting converted to the appropriate replacement character for the codepage (which I believe we can lookup) instead of to 0x00.
It should be given 2 bytes of space, not 1, because that's how much it needs.

I'm going to explore item 2 further.

Part 1, Item 2

The stack where this occurs is

>	OpenConsole.exe!ConvertToOem(const unsigned int uiCodePage, const wchar_t * const pwchSource, const unsigned int cchSource, char * const pchTarget, const unsigned int cchTarget) Line 272	C++
 	OpenConsole.exe!TranslateUnicodeToOem(const wchar_t * pwchUnicode, const unsigned long cchUnicode, char * pchAnsi, const unsigned long cbAnsi, std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>> & partialEvent) Line 208	C++
 	OpenConsole.exe!COOKED_READ_DATA::_handlePostCharInputLoop(const bool isUnicode, unsigned __int64 & numBytes, unsigned long & controlKeyState) Line 1176	C++
 	OpenConsole.exe!COOKED_READ_DATA::Read(const bool isUnicode, unsigned __int64 & numBytes, unsigned long & controlKeyState) Line 451	C++
 	OpenConsole.exe!COOKED_READ_DATA::Notify(const WaitTerminationReason TerminationReason, const bool fIsUnicode, long * const pReplyStatus, unsigned __int64 * const pNumBytes, unsigned long * const pControlKeyState, void * const __formal) Line 411	C++
 	OpenConsole.exe!ConsoleWaitBlock::Notify(const WaitTerminationReason TerminationReason) Line 152	C++
 	OpenConsole.exe!ConsoleWaitQueue::_NotifyBlock(ConsoleWaitBlock * pWaitBlock, const WaitTerminationReason TerminationReason) Line 117	C++
 	OpenConsole.exe!ConsoleWaitQueue::NotifyWaiters(const bool fNotifyAll, const WaitTerminationReason TerminationReason) Line 90	C++
 	OpenConsole.exe!ConsoleWaitQueue::NotifyWaiters(const bool fNotifyAll) Line 65	C++
 	OpenConsole.exe!InputBuffer::WakeUpReadersWaitingForData() Line 165	C++
 	OpenConsole.exe!InputBuffer::Write(std::deque<std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>>,std::allocator<std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>>>> & inEvents) Line 581	C++
 	OpenConsole.exe!InputBuffer::Write(std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>> inEvent) Line 539	C++
 	OpenConsole.exe!HandleGenericKeyEvent(KeyEvent keyEvent, const bool generateBreak) Line 165	C++
 	OpenConsole.exe!HandleKeyEvent(HWND__ * const hWnd, const unsigned int Message, const unsigned __int64 wParam, const __int64 lParam, int * pfUnlockConsole) Line 467	C++
 	OpenConsole.exe!Microsoft::Console::Interactivity::Win32::Window::ConsoleWindowProc(HWND__ * hWnd, unsigned int Message, unsigned __int64 wParam, __int64 lParam) Line 523	C++
 	OpenConsole.exe!Microsoft::Console::Interactivity::Win32::Window::s_ConsoleWindowProc(HWND__ * hWnd, unsigned int Message, unsigned __int64 wParam, __int64 lParam) Line 58	C++
 	user32.dll!UserCallWinProcCheckWow(_ACTIVATION_CONTEXT * pActCtx, __int64(*)(tagWND *, unsigned int, unsigned __int64, __int64) pfn, HWND__ * hwnd, _WM_VALUE msg, unsigned __int64 wParam, __int64 lParam, void * fEnableLiteHooks, int) Line 280	C++
 	user32.dll!DispatchMessageWorker(tagMSG * pmsg, int fAnsi) Line 3157	C++
 	OpenConsole.exe!ConsoleInputThreadProcWin32(void * __formal) Line 1082	C++
 	kernel32.dll!BaseThreadInitThunk(unsigned long RunProcessInit, long(*)(void *) StartAddress, void * Argument) Line 70	C
 	ntdll.dll!RtlUserThreadStart(long(*)(void *) StartAddress, void * Argument) Line 1152	C

One frame up Code reference:

terminal/src/host/dbcs.cpp

Lines 157 to 208 in 09471c3

    
           _Ret_range_(0, cbAnsi) 
        
               ULONG TranslateUnicodeToOem(_In_reads_(cchUnicode) PCWCHAR pwchUnicode, 
        
                                           const ULONG cchUnicode, 
        
                                           _Out_writes_bytes_(cbAnsi) PCHAR pchAnsi, 
        
                                           const ULONG cbAnsi, 
        
                                           _Out_ std::unique_ptr<IInputEvent>& partialEvent) 
        
           { 
        
               const CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation(); 
        
               PWCHAR const TmpUni = new (std::nothrow) WCHAR[cchUnicode]; 
        
               if (TmpUni == nullptr) 
        
               { 
        
                   return 0; 
        
               } 
        
               memcpy(TmpUni, pwchUnicode, cchUnicode * sizeof(WCHAR)); 
        
               BYTE AsciiDbcs[2]; 
        
               AsciiDbcs[1] = 0; 
        
               ULONG i, j; 
        
               for (i = 0, j = 0; i < cchUnicode && j < cbAnsi; i++, j++) 
        
               { 
        
                   if (IsGlyphFullWidth(TmpUni[i])) 
        
                   { 
        
                       ULONG const NumBytes = sizeof(AsciiDbcs); 
        
                       ConvertToOem(gci.CP, &TmpUni[i], 1, (LPSTR)&AsciiDbcs[0], NumBytes); 
        
                       if (IsDBCSLeadByteConsole(AsciiDbcs[0], &gci.CPInfo)) 
        
                       { 
        
                           if (j < cbAnsi - 1) 
        
                           { // -1 is safe DBCS in buffer 
        
                               pchAnsi[j] = AsciiDbcs[0]; 
        
                               j++; 
        
                               pchAnsi[j] = AsciiDbcs[1]; 
        
                               AsciiDbcs[1] = 0; 
        
                           } 
        
                           else 
        
                           { 
        
                               pchAnsi[j] = AsciiDbcs[0]; 
        
                               break; 
        
                           } 
        
                       } 
        
                       else 
        
                       { 
        
                           pchAnsi[j] = AsciiDbcs[0]; 
        
                           AsciiDbcs[1] = 0; 
        
                       } 
        
                   } 
        
                   else 
        
                   { 
        
                       ConvertToOem(gci.CP, &TmpUni[i], 1, &pchAnsi[j], 1); 
        
                   } 
        
               }

The length of space given to ConvertToOem is 1 when IsGlyphFullWidth returns false on the given character α. Otherwise, two bytes are given as the length of the array BYTE AsciiDbcs[2];

I momentarily drilled into IsGlyphFullWidth to check what it's doing before realizing that the full-width or half-widthness of a glyph really has nothing to do with how many bytes it's going to be consuming when returned to the user. But it does lead me down one potential revelation: When full width is checked and we don't have an answer for it in the table, we ask the font. The font chosen in conhost varies between choosing Output Codepage 437 and Output Codepage 932.

The questions then are:
A. Is there a divergence here between output CP 437 and output CP 932 in respect to what IsGlyphFullWidth reports?
B. Why is this using the glyph width to determine whether it fits? (Shouldn't it just use whatever is necessary?)
C. How is this different from conhostv1.dll because it worked there? (Was it checking width? Did it treat fonts differently?)

miniksa · 2020-10-20T23:07:51Z

Part 1, Item 2, Questions

A. Is there a divergence here between output CP 437 and output CP 932 in respect to what IsGlyphFullWidth reports?

No. That's not the issue. It's the font that's the issue. There's two circumstances I've now observed in conhostv1.dll, but keep in mind that it tends to choose slightly different fonts than the most recent one because we've unlocked the ability to switch codepages early in the conhostv2 journey (as the locking to a specific one, the OEMCP one from startup for DBCS codepages, was a relic from when we sold a "Japanese Edition" of Windows or a "Chinese Edition" of Windows.)

If the font says that the α is wide, it gets translated into two bytes and fits and we see the 0x83 0xbf answer.
If the font says that the α is narrow, it literally returns one byte of uninitialized memory! The code didn't check the return code from the conversion and just walks on as if 1 byte was converted!

B. Why is this using the glyph width to determine whether it fits? (Shouldn't it just use whatever is necessary?)

As far as I can tell, this is the long standing mix about that someone did long ago when implementing this. The terminology of "full width" and "double byte" and the number of CHARs used to store something are all sort of mixed up as the same thing even though they can vary dramatically. For instance, a narrow character can be 2 bytes in a DBCS codepage.

C. How is this different from conhostv1.dll because it worked there? (Was it checking width? Did it treat fonts differently?)

It's different because it was using RtlUnicodeToOemN when the system startup OEMCP matched the currently chosen input code page. For conhostv2.dll, the current conhost.exe, and openconsole.exe, we run everything through the WideCharToMultiByte function instead, deduplicating the code path (in theory) AND we tend to end up with a different font that is saying this is narrow because of the broader unlocked choice. Finally, the v2 one actually checks error return codes and tries to compensate for them where v1 just returns either uninitialized memory or just so happens to have enough memory for the conversion half by fluke due to the conflation of the "wide" and "2 byte" terminology.

Conclusion

Due to the total mix around of DBCS and wide/narrow and 1/2 bytes in conhostv1 combined with the fact that certain fonts returned uninitialized memory, the v1 variant is not a model for anyone to follow as sanity.

I believe that the fix here is to make the v2 one not necessarily check the "width" of the character as a measure of how many bytes it will take and instead just do the conversion filling as much buffer as it has access to over returning a \0 as failure (and stow the leftover bytes for the next call, per the read model of many other ReadConsole* family functions.)

miniksa · 2020-10-21T19:04:46Z

Okay, so I have a proposed fix for the translation part in ConvertUnicodeToOem that:

Doesn't try to count the width of characters in bytes or the widthness of a byte. It just converts up to the available space given.
Uses the default character if the conversion fails instead of leaving behind a null byte (or the even worse v1 policy of an uninitialized byte.)

It's here: c1ca8f3

But now there's another problem, the parent calling it isn't giving enough buffer space. And that isn't because the user buffer provided in the API call was necessarily too small, it's because the parent function _handlePostCharInputLoop is pre-guessing how many bytes it will take by checking the WIDTH of every character and reserving two codepage bytes for a wide and one for a narrow.

terminal/src/host/readDataCooked.cpp

Lines 1003 to 1199 in 9b92986

    
           [[nodiscard]] NTSTATUS COOKED_READ_DATA::_handlePostCharInputLoop(const bool isUnicode, size_t& numBytes, ULONG& controlKeyState) noexcept 
        
           { 
        
               DWORD LineCount = 1; 
        
               if (_echoInput) 
        
               { 
        
                   // Figure out where real string ends (at carriage return or end of buffer). 
        
                   PWCHAR StringPtr = _backupLimit; 
        
                   size_t StringLength = _bytesRead; 
        
                   bool FoundCR = false; 
        
                   for (size_t i = 0; i < (_bytesRead / sizeof(WCHAR)); i++) 
        
                   { 
        
                       if (*StringPtr++ == UNICODE_CARRIAGERETURN) 
        
                       { 
        
                           StringLength = i * sizeof(WCHAR); 
        
                           FoundCR = true; 
        
                           break; 
        
                       } 
        
                   } 
        
                   if (FoundCR) 
        
                   { 
        
                       if (_commandHistory) 
        
                       { 
        
                           // add to command line recall list if we have a history list. 
        
                           CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation(); 
        
                           LOG_IF_FAILED(_commandHistory->Add({ _backupLimit, StringLength / sizeof(wchar_t) }, 
        
                                                              WI_IsFlagSet(gci.Flags, CONSOLE_HISTORY_NODUP))); 
        
                       } 
        
                       // check for alias 
        
                       ProcessAliases(LineCount); 
        
                   } 
        
               } 
        
               bool fAddDbcsLead = false; 
        
               size_t NumBytes = 0; 
        
               // at this point, a->NumBytes contains the number of bytes in 
        
               // the UNICODE string read.  UserBufferSize contains the converted 
        
               // size of the app's buffer. 
        
               if (_bytesRead > _userBufferSize || LineCount > 1) 
        
               { 
        
                   if (LineCount > 1) 
        
                   { 
        
                       PWSTR Tmp; 
        
                       if (!isUnicode) 
        
                       { 
        
                           if (_pInputBuffer->IsReadPartialByteSequenceAvailable()) 
        
                           { 
        
                               fAddDbcsLead = true; 
        
                               std::unique_ptr<IInputEvent> event = GetInputBuffer()->FetchReadPartialByteSequence(false); 
        
                               const KeyEvent* const pKeyEvent = static_cast<const KeyEvent* const>(event.get()); 
        
                               *_userBuffer = static_cast<char>(pKeyEvent->GetCharData()); 
        
                               _userBuffer++; 
        
                               _userBufferSize -= sizeof(wchar_t); 
        
                           } 
        
                           NumBytes = 0; 
        
                           for (Tmp = _backupLimit; 
        
                                *Tmp != UNICODE_LINEFEED && _userBufferSize / sizeof(WCHAR) > NumBytes; 
        
                                Tmp++) 
        
                           { 
        
                               NumBytes += IsGlyphFullWidth(*Tmp) ? 2 : 1; 
        
                           } 
        
                       } 
        
                       // clang-format off 
        
           #pragma prefast(suppress: __WARNING_BUFFER_OVERFLOW, "LineCount > 1 means there's a UNICODE_LINEFEED") 
        
                       // clang-format on 
        
                       for (Tmp = _backupLimit; *Tmp != UNICODE_LINEFEED; Tmp++) 
        
                       { 
        
                           FAIL_FAST_IF(!(Tmp < (_backupLimit + _bytesRead))); 
        
                       } 
        
                       numBytes = (ULONG)(Tmp - _backupLimit + 1) * sizeof(*Tmp); 
        
                   } 
        
                   else 
        
                   { 
        
                       if (!isUnicode) 
        
                       { 
        
                           PWSTR Tmp; 
        
                           if (_pInputBuffer->IsReadPartialByteSequenceAvailable()) 
        
                           { 
        
                               fAddDbcsLead = true; 
        
                               std::unique_ptr<IInputEvent> event = GetInputBuffer()->FetchReadPartialByteSequence(false); 
        
                               const KeyEvent* const pKeyEvent = static_cast<const KeyEvent* const>(event.get()); 
        
                               *_userBuffer = static_cast<char>(pKeyEvent->GetCharData()); 
        
                               _userBuffer++; 
        
                               _userBufferSize -= sizeof(wchar_t); 
        
                           } 
        
                           NumBytes = 0; 
        
                           size_t NumToWrite = _bytesRead; 
        
                           for (Tmp = _backupLimit; 
        
                                NumToWrite && _userBufferSize / sizeof(WCHAR) > NumBytes; 
        
                                Tmp++, NumToWrite -= sizeof(WCHAR)) 
        
                           { 
        
                               NumBytes += IsGlyphFullWidth(*Tmp) ? 2 : 1; 
        
                           } 
        
                       } 
        
                       numBytes = _userBufferSize; 
        
                   } 
        
                   __analysis_assume(numBytes <= _userBufferSize); 
        
                   memmove(_userBuffer, _backupLimit, numBytes); 
        
                   INPUT_READ_HANDLE_DATA* const pInputReadHandleData = GetInputReadHandleData(); 
        
                   const std::wstring_view pending{ _backupLimit + (numBytes / sizeof(wchar_t)), (_bytesRead - numBytes) / sizeof(wchar_t) }; 
        
                   if (LineCount > 1) 
        
                   { 
        
                       pInputReadHandleData->SaveMultilinePendingInput(pending); 
        
                   } 
        
                   else 
        
                   { 
        
                       pInputReadHandleData->SavePendingInput(pending); 
        
                   } 
        
               } 
        
               else 
        
               { 
        
                   if (!isUnicode) 
        
                   { 
        
                       PWSTR Tmp; 
        
                       if (_pInputBuffer->IsReadPartialByteSequenceAvailable()) 
        
                       { 
        
                           fAddDbcsLead = true; 
        
                           std::unique_ptr<IInputEvent> event = GetInputBuffer()->FetchReadPartialByteSequence(false); 
        
                           const KeyEvent* const pKeyEvent = static_cast<const KeyEvent* const>(event.get()); 
        
                           *_userBuffer = static_cast<char>(pKeyEvent->GetCharData()); 
        
                           _userBuffer++; 
        
                           _userBufferSize -= sizeof(wchar_t); 
        
                           if (_userBufferSize == 0) 
        
                           { 
        
                               numBytes = 1; 
        
                               return STATUS_SUCCESS; 
        
                           } 
        
                       } 
        
                       NumBytes = 0; 
        
                       size_t NumToWrite = _bytesRead; 
        
                       for (Tmp = _backupLimit; 
        
                            NumToWrite && _userBufferSize / sizeof(WCHAR) > NumBytes; 
        
                            Tmp++, NumToWrite -= sizeof(WCHAR)) 
        
                       { 
        
                           NumBytes += IsGlyphFullWidth(*Tmp) ? 2 : 1; 
        
                       } 
        
                   } 
        
                   numBytes = _bytesRead; 
        
                   if (numBytes > _userBufferSize) 
        
                   { 
        
                       return STATUS_BUFFER_OVERFLOW; 
        
                   } 
        
                   memmove(_userBuffer, _backupLimit, numBytes); 
        
               } 
        
               controlKeyState = _controlKeyState; 
        
               if (!isUnicode) 
        
               { 
        
                   // if ansi, translate string. 
        
                   std::unique_ptr<char[]> tempBuffer; 
        
                   try 
        
                   { 
        
                       tempBuffer = std::make_unique<char[]>(NumBytes); 
        
                   } 
        
                   catch (...) 
        
                   { 
        
                       return STATUS_NO_MEMORY; 
        
                   } 
        
                   std::unique_ptr<IInputEvent> partialEvent; 
        
                   numBytes = TranslateUnicodeToOem(_userBuffer, 
        
                                                    gsl::narrow<ULONG>(numBytes / sizeof(wchar_t)), 
        
                                                    tempBuffer.get(), 
        
                                                    gsl::narrow<ULONG>(NumBytes), 
        
                                                    partialEvent); 
        
                   if (partialEvent.get()) 
        
                   { 
        
                       GetInputBuffer()->StoreReadPartialByteSequence(std::move(partialEvent)); 
        
                   } 
        
                   if (numBytes > _userBufferSize) 
        
                   { 
        
                       return STATUS_BUFFER_OVERFLOW; 
        
                   } 
        
                   memmove(_userBuffer, tempBuffer.get(), numBytes); 
        
                   if (fAddDbcsLead) 
        
                   { 
        
                       numBytes++; 
        
                   } 
        
               } 
        
               return STATUS_SUCCESS; 
        
           }

That sort of behavior is standard practice for COOKED READ.... that is guessing byte count based on character width which can be dependent on the font chosen which can be dependent on the code page selected. Gross.

So my plan is to try to make a targeted fix in _handlePostCharInputLoop for this for now, but we're going to need to book all up COOKED READ rework soonish because very similar issues are impacting every other issue on the tracker with "cooked read", "utf8 and read*a APIs", and "emoji in cmd.exe" like this one #1503.

Also.... ConvertUnicodeToOem might just be completely unnecessary as it sounds an awful lot like some of our other ConvertToA like functions and/or the til::u16a and friends from #4493 that I might have an opportunity to resurrect (@german-one, fyi).

miniksa · 2020-10-23T18:58:18Z

I wrote a few tests here to cover this as I work on fixing it:

https://github.com/microsoft/terminal/blob/8561bd217dea3e8c77f281ba70d1b181069e95a7/src/host/ft_host/API_AliasTests.cpp

german-one · 2020-10-24T11:53:05Z

(@german-one, fyi)

Feel free to do whatever is necessary to make any good out of the code. Get back to me if you'd like me to work on it any further. Note: There are still codepages where the handling of partials is not supported. They are listed in the method description of operator(). That's either because I didn't find the spec or because it's quite tricky (like for UTF-7 which is a mix of ASCII and base64).

miniksa · 2020-10-27T17:32:34Z

(@german-one, fyi)

Feel free to do whatever is necessary to make any good out of the code. Get back to me if you'd like me to work on it any further. Note: There are still codepages where the handling of partials is not supported. They are listed in the method description of operator(). That's either because I didn't find the spec or because it's quite tricky (like for UTF-7 which is a mix of ASCII and base64).

That's fine. Thank you!

The overarching intention of this PR is to improve our Unicode support. Most of our APIs still don't support anything beyond UCS-2 and DBCS sequences. This commit doesn't fix the UTF-16 support (by supporting surrogate pairs), but it does improve support for UTF-8 by allowing longer `char` sequences. It does so by removing `TranslateUnicodeToOem` which seems to have had an almost viral effect on code quality wherever it was used. It made the assumption that _all_ narrow glyphs encode to 1 `char` and most wide glyphs to 2 `char`s. It also didn't bother to check whether `WideCharToMultiByte` failed or returned a different amount of `char`s. So up until now it was easily possible to read uninitialized stack memory from conhost. Any code that used this function was forced to do the same "measurement" of narrow/wide glyphs, because _of course_ it didn't had any way to indicate to the caller how much memory it needs to store the result. Instead all callers were forced to sorta replicate how it worked to calculate the required storage ahead of time. Unsurprisingly, none of the callers used the same algorithm... Without it the code is much leaner and easier to understand now. The best example is `COOKED_READ_DATA::_handlePostCharInputLoop` which used to contain 3 blocks of _almost_ identical code, but with ever so subtle differences. After reading the old code for hours I still don't know if they were relevant or not. It used to be 200 lines of code lacking any documentation and it's now 50 lines with descriptive function names. I hope this doesn't break anything, but to be honest I can't imagine anyone having relied on this mess in the first place. I needed some helpers to handle byte slices (`std::span<char>`), which is why a new `til/bytes.h` header was added. Initially I wrote a `buf_writer` class but felt like such a wrapper around a slice/span was annoying to use. As such I've opted for freestanding functions which take slices as mutable references and "advance" them (offset the start) whenever they're read from or written to. I'm not particularly happy with the design but they do the job. Related to #8000 Fixes #4551 Fixes #7589 Fixes #8663 ## Validation Steps Performed * Unit and feature tests ✅ * Far Manager ✅ * Fixes test cases in #4551, #7589 and #8663 ✅

ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Sep 10, 2020

ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Oct 9, 2020

zadjii-msft added this to the Terminal v2.0 milestone Oct 9, 2020

miniksa self-assigned this Oct 16, 2020

miniksa mentioned this issue Oct 21, 2020

[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

Closed

skyline75489 mentioned this issue Nov 22, 2020

Some characters can not be recongized by Python shell in docker #8360

Closed

KalleOlaviNiemitalo mentioned this issue Dec 27, 2020

OpenConsole crash when pasting emoji #8663

Closed

zadjii-msft modified the milestones: Terminal v2.0, 22H2 Jan 4, 2022

miniksa removed their assignment Mar 31, 2022

lhecker self-assigned this Jan 23, 2023

lhecker mentioned this issue Jan 31, 2023

Remove TranslateUnicodeToOem and all related code #14745

Merged

ghost added the In-PR This issue has a related PR label Jan 31, 2023

DHowett closed this as completed in #14745 Feb 28, 2023

microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589

Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589

tyan0 commented Sep 10, 2020 •

edited

Loading

zadjii-msft commented Oct 9, 2020

DHowett commented Oct 9, 2020

zadjii-msft commented Oct 9, 2020

tyan0 commented Oct 9, 2020

eryksun commented Oct 10, 2020

miniksa commented Oct 16, 2020

miniksa commented Oct 19, 2020

miniksa commented Oct 20, 2020

miniksa commented Oct 21, 2020

miniksa commented Oct 23, 2020

german-one commented Oct 24, 2020 •

edited

Loading

miniksa commented Oct 27, 2020

Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589

Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589

Comments

tyan0 commented Sep 10, 2020 • edited Loading

Environment

Steps to reproduce

Expected behavior

Actual behavior

Supplement

zadjii-msft commented Oct 9, 2020

DHowett commented Oct 9, 2020

zadjii-msft commented Oct 9, 2020

tyan0 commented Oct 9, 2020

eryksun commented Oct 10, 2020

miniksa commented Oct 16, 2020

Part 1:

Part 2:

Summary so far:

miniksa commented Oct 19, 2020

Part 1:

Part 1, Item 2

miniksa commented Oct 20, 2020

Part 1, Item 2, Questions

Conclusion

miniksa commented Oct 21, 2020

miniksa commented Oct 23, 2020

german-one commented Oct 24, 2020 • edited Loading

miniksa commented Oct 27, 2020

tyan0 commented Sep 10, 2020 •

edited

Loading

german-one commented Oct 24, 2020 •

edited

Loading