Remove TranslateUnicodeToOem and all related code #14745

lhecker · 2023-01-27T16:51:26Z

The overarching intention of this PR is to improve our Unicode support. Most
of our APIs still don't support anything beyond UCS-2 and DBCS sequences.
This commit doesn't fix the UTF-16 support (by supporting surrogate pairs),
but it does improve support for UTF-8 by allowing longer char sequences.

It does so by removing TranslateUnicodeToOem which seems to have had an almost
viral effect on code quality wherever it was used. It made the assumption that
all narrow glyphs encode to 1 char and most wide glyphs to 2 chars.
It also didn't bother to check whether WideCharToMultiByte failed or returned
a different amount of chars. So up until now it was easily possible to read
uninitialized stack memory from conhost. Any code that used this function was
forced to do the same "measurement" of narrow/wide glyphs, because of course
it didn't had any way to indicate to the caller how much memory it needs to
store the result. Instead all callers were forced to sorta replicate how it
worked to calculate the required storage ahead of time.
Unsurprisingly, none of the callers used the same algorithm...

Without it the code is much leaner and easier to understand now. The best
example is COOKED_READ_DATA::_handlePostCharInputLoop which used to contain 3
blocks of almost identical code, but with ever so subtle differences. After
reading the old code for hours I still don't know if they were relevant or not.
It used to be 200 lines of code lacking any documentation and it's now 50 lines
with descriptive function names. I hope this doesn't break anything, but to
be honest I can't imagine anyone having relied on this mess in the first place.

I needed some helpers to handle byte slices (std::span<char>), which is why
a new til/bytes.h header was added. Initially I wrote a buf_writer class
but felt like such a wrapper around a slice/span was annoying to use.
As such I've opted for freestanding functions which take slices as mutable
references and "advance" them (offset the start) whenever they're read from
or written to. I'm not particularly happy with the design but they do the job.

Related to #8000
Fixes #4551
Fixes #7589
Fixes #8663

Validation Steps Performed

Unit and feature tests ✅
Far Manager ✅
Fixes test cases in COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551, Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589 and OpenConsole crash when pasting emoji #8663 ✅

DHowett

9/16 i've skipped the hard stuff for now

DHowett · 2023-01-27T18:00:18Z

src/inc/til/type_traits.h

+        };
+        // std::span remains largely unused in this code base so far.
+        //template<typename U, std::size_t E>
+        //struct is_contiguous_view<std::span<U, E>> : std::true_type


IMO no reason to have it commented out, it'll just be one more place to change when we do use std::span. uncomment?

We don't import the <span> anywhere yet, so the type usually isn't defined. I'm planning to simply replace all gsl::span usage with std::span in the future. Matter of fact, I might just be doing it right now.

#14763 just merged so you're good to go now haha

src/host/misc.h

src/inc/til/at.h

DHowett · 2023-01-27T19:10:25Z

src/server/ApiDispatchers.cpp


-    // TODO: This is also rather strange and will also probably make more sense if we stop guessing that we need 2x buffer to convert.
    // This might need to go on the other side of the fence (inside host) because the server doesn't know what we're going to do with initial num bytes.


Do we still need to figure out whether this "might"?

To be entirely honest I don't quite get what this comment means and what InitialNumBytes is supposed to do. I can try to figure it out, but it can probably also be left as is for the time being.

src/host/stream.h

DHowett · 2023-01-27T19:13:17Z

Also related to #4551?

lhecker · 2023-01-31T14:52:30Z

Also fixes #7589.

When working on #14745 I found `KeyEvent`s a little hard to read in the debugger. I noticed that this is because of sign extension when converting `char`s to `wchar_t`s in `KeyEvent::SetCharData`.

When working on #14745 I noticed that `SplitToOem` was in a bit of a poor state as well. Instead of simply iterating over its `deque` argument and writing the results into a new `deque` it used `pop` to advance the head of both queues. This isn't quite exception safe and rather bloaty. Additionally there's no need to call `WideCharToMultiByte` twice on each character if we know that the most verbose encoding is UTF-8 which can't be any more than 4 chars anyways. Related to #8000. ## PR Checklist * 2 unit tests cover this ✅

…slateUnicodeToOem-wip

lhecker · 2023-02-03T14:35:30Z

Now after rewriting half of InputBuffer::Read I think that this is pretty solid. I've chosen not to implement support for surrogate pairs, because it was quite frankly a major pain in the 🍑 to stitch KeyEvents back together. Instead, it'd be much better to simply add support for surrogate pairs to KeyEvent itself (i.e. have it store up to 2 wchar_t), so that we don't have to do any stitching in the first place. Or if that's not possible, I'll add support for the stitching some time in the future. For this PR though it was just too much IMO.

Also:

DHowett · 2023-02-28T19:34:36Z

src/inc/til/type_traits.h

-        struct is_contiguous_view<std::basic_string_view<U, V>> : std::true_type
-        {
-        };
-#ifdef GSL_SPAN_H
        template<typename U, std::size_t E>
        struct is_contiguous_view<std::span<U, E>> : std::true_type


Ah woops. 😅 Well, problem fixed!

#14745 contains two regressions related to console alias handling: * When `ProcessAliases` expands the backup buffer into (an) aliased command(s) it changes the `_bytesRead` field of `COOKED_READ_DATA`, requiring us to re-read it and reconstruct the `input` string-view. * Multiline aliases are read line-by-line whereas #14745 didn't treat them any different from regular single-line inputs. ## Validation Steps Performed In `cmd.exe` run ``` doskey test=echo foo$Techo bar$Techo baz test ``` The output should look exactly like this: ``` C:\>doskey test=echo foo$Techo bar$Techo baz C:\>test foo C:\>bar C:\>baz C:\> ```

#14745 removed the only user of `GetAugmentedOutputBuffer`.

alexrp · 2023-03-30T23:13:37Z

Just wanted to check: With this PR merged, is the ReadFile w/ UTF-8 code page situation supposed to be improved?

lhecker · 2023-03-31T00:34:23Z

Yes, the test case as described in #4551 is now fixed. However, I believe that "improved" is just the right choice of words: There's still no consistent internal support for surrogate pairs for instance, but this is something that I'm actively working on.

methane · 2023-05-21T05:18:16Z

Has this fix shipped in Windows 11 conhost already?

lhecker · 2023-05-22T11:06:43Z

Allegedly this will not ship for quite a while unfortunately. 10000s of lines of other changes are also waiting to be shipped, including massive correctness and performance improvements, and I'm quite excited to see when it does.

zadjii-msft · 2023-05-22T11:09:18Z

(It will ship in the Terminal in 1.18 Soon^TM though)

methane · 2023-05-22T11:31:51Z

Thank you!

SainoNamkho · 2023-07-27T10:40:29Z

Is this related to #12626?

Excise TranslateUnicodeToOem and everything it touched

7b5b96d

lhecker added a commit that referenced this pull request Jan 27, 2023

Minor cleanups after #14745

1816410

This was referenced Jan 27, 2023

Minor improvements for SplitToOem #14746

Merged

Make KeyEvent char data a little less confusing #14747

Merged

Minor cleanups after #14745 #14748

Merged

Fix build

819febe

DHowett reviewed Jan 27, 2023

View reviewed changes

lhecker mentioned this pull request Feb 1, 2023

Don't crash when reading emoji input in utf8 #12342

Closed

3 tasks

lhecker added 2 commits February 3, 2023 15:20

Merge remote-tracking branch 'origin/main' into dev/lhecker/8000-Tran…

ac1f371

…slateUnicodeToOem-wip

Rewrite half InputBuffer

caba56d

lhecker added 2 commits February 3, 2023 15:37

Remove previously added code

eac0aeb

Address feedback

005bc21

lhecker changed the title ~~Excise TranslateUnicodeToOem and everything it touched~~ Remove TranslateUnicodeToOem and all related code Feb 3, 2023

lhecker added 2 commits February 5, 2023 14:17

Simplify InputBuffer::Read and fix KeyEvent repeats

738ce5f

Fix zero initialization

a0f1637

Address feedback

7d9a213

microsoft-github-policy-service bot added Priority-1 A description (P1) Severity-Crash Crashes are real bad news. labels Feb 28, 2023

Fix bad merge

ee2850d

DHowett reviewed Feb 28, 2023

View reviewed changes

DHowett approved these changes Feb 28, 2023

View reviewed changes

DHowett merged commit 599b550 into main Feb 28, 2023

DHowett deleted the dev/lhecker/8000-TranslateUnicodeToOem branch February 28, 2023 20:55

lhecker mentioned this pull request Mar 14, 2023

Fix console aliases not working #14991

Merged

zadjii-msft pushed a commit that referenced this pull request Mar 30, 2023

Minor cleanups after #14745 (#14748)

da3a33f

#14745 removed the only user of `GetAugmentedOutputBuffer`.

DHowett mentioned this pull request Apr 5, 2023

more command doesn't advance to the next screen of text if I press spacebar #15116

Closed

zadjii-msft mentioned this pull request Apr 26, 2023

[1.18] Defterm was only busted on my machine #15238

Closed

This was referenced May 20, 2023

从控制台输入的中文问题 #15380

Closed

Console UTF-8 input is misbehaving on Windows dotnet/runtime#43295

Closed

Terminal should force pseudoconsole host into UTF-8 codepage by default #1802

Open

lhecker mentioned this pull request Jul 27, 2023

ReadConsoleA use with PeekConsoleInput returns a string with leading garbage #12626

Closed

lhecker mentioned this pull request Sep 25, 2023

ReadConsole does not work with utf-8 codepage #16020

Closed

confusedsushi mentioned this pull request Oct 24, 2023

Lost new line character when using ReadFile #16223

Closed

lhecker mentioned this pull request Nov 15, 2023

Fix input buffering for A APIs #16313

Merged

lhecker mentioned this pull request Jan 29, 2024

Reading input byte by byte causes lines with even numbers of characters #16606

Closed

lhecker mentioned this pull request Mar 24, 2024

Implement grapheme clusters #16916

Merged

lhecker mentioned this pull request Sep 23, 2024

Emojis still crashing conhost.exe #17948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove TranslateUnicodeToOem and all related code #14745

Remove TranslateUnicodeToOem and all related code #14745

lhecker commented Jan 27, 2023 •

edited

Loading

DHowett left a comment

DHowett Jan 27, 2023

lhecker Jan 27, 2023

carlos-zamora Feb 2, 2023

DHowett Feb 16, 2023

DHowett Jan 27, 2023

lhecker Jan 27, 2023

DHowett commented Jan 27, 2023

lhecker commented Jan 31, 2023

lhecker commented Feb 3, 2023 •

edited

Loading

DHowett Feb 28, 2023

lhecker Feb 28, 2023

alexrp commented Mar 30, 2023

lhecker commented Mar 31, 2023

methane commented May 21, 2023

lhecker commented May 22, 2023 •

edited

Loading

zadjii-msft commented May 22, 2023

methane commented May 22, 2023

SainoNamkho commented Jul 27, 2023


		// TODO: This is also rather strange and will also probably make more sense if we stop guessing that we need 2x buffer to convert.
		// This might need to go on the other side of the fence (inside host) because the server doesn't know what we're going to do with initial num bytes.

Remove TranslateUnicodeToOem and all related code #14745

Remove TranslateUnicodeToOem and all related code #14745

Conversation

lhecker commented Jan 27, 2023 • edited Loading

Validation Steps Performed

DHowett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DHowett commented Jan 27, 2023

lhecker commented Jan 31, 2023

lhecker commented Feb 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexrp commented Mar 30, 2023

lhecker commented Mar 31, 2023

methane commented May 21, 2023

lhecker commented May 22, 2023 • edited Loading

zadjii-msft commented May 22, 2023

methane commented May 22, 2023

SainoNamkho commented Jul 27, 2023

lhecker commented Jan 27, 2023 •

edited

Loading

lhecker commented Feb 3, 2023 •

edited

Loading

lhecker commented May 22, 2023 •

edited

Loading