Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TranslateUnicodeToOem and all related code #14745

Merged
merged 12 commits into from
Feb 28, 2023

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Jan 27, 2023

The overarching intention of this PR is to improve our Unicode support. Most
of our APIs still don't support anything beyond UCS-2 and DBCS sequences.
This commit doesn't fix the UTF-16 support (by supporting surrogate pairs),
but it does improve support for UTF-8 by allowing longer char sequences.

It does so by removing TranslateUnicodeToOem which seems to have had an almost
viral effect on code quality wherever it was used. It made the assumption that
all narrow glyphs encode to 1 char and most wide glyphs to 2 chars.
It also didn't bother to check whether WideCharToMultiByte failed or returned
a different amount of chars. So up until now it was easily possible to read
uninitialized stack memory from conhost. Any code that used this function was
forced to do the same "measurement" of narrow/wide glyphs, because of course
it didn't had any way to indicate to the caller how much memory it needs to
store the result. Instead all callers were forced to sorta replicate how it
worked to calculate the required storage ahead of time.
Unsurprisingly, none of the callers used the same algorithm...

Without it the code is much leaner and easier to understand now. The best
example is COOKED_READ_DATA::_handlePostCharInputLoop which used to contain 3
blocks of almost identical code, but with ever so subtle differences. After
reading the old code for hours I still don't know if they were relevant or not.
It used to be 200 lines of code lacking any documentation and it's now 50 lines
with descriptive function names. I hope this doesn't break anything, but to
be honest I can't imagine anyone having relied on this mess in the first place.

I needed some helpers to handle byte slices (std::span<char>), which is why
a new til/bytes.h header was added. Initially I wrote a buf_writer class
but felt like such a wrapper around a slice/span was annoying to use.
As such I've opted for freestanding functions which take slices as mutable
references and "advance" them (offset the start) whenever they're read from
or written to. I'm not particularly happy with the design but they do the job.

Related to #8000
Fixes #4551
Fixes #7589
Fixes #8663

Validation Steps Performed

@lhecker lhecker added Product-Conhost For issues in the Console codebase Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Product-Terminal The new Windows Terminal. Issue-Scenario labels Jan 27, 2023
lhecker added a commit that referenced this pull request Jan 27, 2023
Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9/16 i've skipped the hard stuff for now

};
// std::span remains largely unused in this code base so far.
//template<typename U, std::size_t E>
//struct is_contiguous_view<std::span<U, E>> : std::true_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO no reason to have it commented out, it'll just be one more place to change when we do use std::span. uncomment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't import the <span> anywhere yet, so the type usually isn't defined. I'm planning to simply replace all gsl::span usage with std::span in the future. Matter of fact, I might just be doing it right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14763 just merged so you're good to go now haha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree!

src/host/misc.h Outdated Show resolved Hide resolved
src/host/misc.h Outdated Show resolved Hide resolved
src/host/misc.h Outdated Show resolved Hide resolved
src/inc/til/at.h Outdated Show resolved Hide resolved

// TODO: This is also rather strange and will also probably make more sense if we stop guessing that we need 2x buffer to convert.
// This might need to go on the other side of the fence (inside host) because the server doesn't know what we're going to do with initial num bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to figure out whether this "might"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be entirely honest I don't quite get what this comment means and what InitialNumBytes is supposed to do. I can try to figure it out, but it can probably also be left as is for the time being.

src/host/stream.h Outdated Show resolved Hide resolved
@DHowett
Copy link
Member

DHowett commented Jan 27, 2023

Also related to #4551?

@ghost ghost added Area-CookedRead The cmd.exe COOKED_READ handling Area-Input Related to input processing (key presses, mouse, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Task It's a feature request, but it doesn't really need a major design. Priority-2 A description (P2) labels Jan 31, 2023
@lhecker
Copy link
Member Author

lhecker commented Jan 31, 2023

Also fixes #7589.

carlos-zamora pushed a commit that referenced this pull request Feb 2, 2023
When working on #14745 I found `KeyEvent`s a little hard to read in the
debugger. I noticed that this is because of sign extension when converting
`char`s to `wchar_t`s in `KeyEvent::SetCharData`.
carlos-zamora pushed a commit that referenced this pull request Feb 2, 2023
When working on #14745 I noticed that `SplitToOem` was in a bit of a poor state
as well. Instead of simply iterating over its `deque` argument and writing the
results into a new `deque` it used `pop` to advance the head of both queues.
This isn't quite exception safe and rather bloaty. Additionally there's no need
to call `WideCharToMultiByte` twice on each character if we know that the most
verbose encoding is UTF-8 which can't be any more than 4 chars anyways.

Related to #8000.

## PR Checklist
* 2 unit tests cover this ✅
@lhecker
Copy link
Member Author

lhecker commented Feb 3, 2023

Now after rewriting half of InputBuffer::Read I think that this is pretty solid. I've chosen not to implement support for surrogate pairs, because it was quite frankly a major pain in the 🍑 to stitch KeyEvents back together. Instead, it'd be much better to simply add support for surrogate pairs to KeyEvent itself (i.e. have it store up to 2 wchar_t), so that we don't have to do any stitching in the first place. Or if that's not possible, I'll add support for the stitching some time in the future. For this PR though it was just too much IMO.

Also:

@lhecker lhecker changed the title Excise TranslateUnicodeToOem and everything it touched Remove TranslateUnicodeToOem and all related code Feb 3, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot added Priority-1 A description (P1) Severity-Crash Crashes are real bad news. labels Feb 28, 2023
struct is_contiguous_view<std::basic_string_view<U, V>> : std::true_type
{
};
#ifdef GSL_SPAN_H
template<typename U, std::size_t E>
struct is_contiguous_view<std::span<U, E>> : std::true_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rofl, oops

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah woops. 😅 Well, problem fixed!

@DHowett DHowett merged commit 599b550 into main Feb 28, 2023
@DHowett DHowett deleted the dev/lhecker/8000-TranslateUnicodeToOem branch February 28, 2023 20:55
carlos-zamora pushed a commit that referenced this pull request Mar 17, 2023
#14745 contains two regressions related to console alias handling:
* When `ProcessAliases` expands the backup buffer into (an) aliased
  command(s) it changes the `_bytesRead` field of `COOKED_READ_DATA`,
  requiring us to re-read it and reconstruct the `input` string-view.
* Multiline aliases are read line-by-line whereas #14745 didn't treat
  them any different from regular single-line inputs.

## Validation Steps Performed
In `cmd.exe` run
```
doskey test=echo foo$Techo bar$Techo baz
test
```
The output should look exactly like this:
```
C:\>doskey test=echo foo$Techo bar$Techo baz

C:\>test
foo

C:\>bar

C:\>baz

C:\>
```
zadjii-msft pushed a commit that referenced this pull request Mar 30, 2023
#14745 removed the only user of `GetAugmentedOutputBuffer`.
@alexrp
Copy link

alexrp commented Mar 30, 2023

Just wanted to check: With this PR merged, is the ReadFile w/ UTF-8 code page situation supposed to be improved?

@lhecker
Copy link
Member Author

lhecker commented Mar 31, 2023

Yes, the test case as described in #4551 is now fixed. However, I believe that "improved" is just the right choice of words: There's still no consistent internal support for surrogate pairs for instance, but this is something that I'm actively working on.

@methane
Copy link

methane commented May 21, 2023

Has this fix shipped in Windows 11 conhost already?

@lhecker
Copy link
Member Author

lhecker commented May 22, 2023

Allegedly this will not ship for quite a while unfortunately. 10000s of lines of other changes are also waiting to be shipped, including massive correctness and performance improvements, and I'm quite excited to see when it does.

@zadjii-msft
Copy link
Member

(It will ship in the Terminal in 1.18 SoonTM though)

@methane
Copy link

methane commented May 22, 2023

Thank you!

@SainoNamkho
Copy link

Is this related to #12626?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-CookedRead The cmd.exe COOKED_READ handling Area-Input Related to input processing (key presses, mouse, etc.) Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Scenario Issue-Task It's a feature request, but it doesn't really need a major design. Priority-1 A description (P1) Priority-2 A description (P2) Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal. Severity-Crash Crashes are real bad news.
Projects
None yet
7 participants