-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I expect (hope) that @eryksun has some things to contribute here, given how intimately he's reverse-engineered our stack |
Research from today.... I'm writing with This is the scenario overview:
|
Current thoughts: Giving leftover bytes of previous codepage after a switch on read continuationI'm pretty sold that the v1 behavior is right here. If you changed the input codepage half way through reading a byte stream, I think it's easy to say you don't care about the left behind trailing bytes for DBCS or theoretically the rest of the UTF-8 sequence. Losing a character when the codepage changesThis feels absolutely wrong on v1's part. We shouldn't be losing characters for switching a codepage. I'd like to figure out why, but I don't intend to preserve that. Returning the read bytes as how many were pulled off the input queue, not how much of the given buffer was usedThis feels really strange and feels like an ages old mistake in So I could see that one going either way. And for UTF-8 it could easily be 1-3 greater than the buffer size, not just the 1 greater that happens with a torn DBCS lead byte. EDIT: I'm now more strongly favoring the number read should never exceed the buffer size. Because then what do we say on the return of the extra held bytes? Is the total count now absurdly larger than the total concatenated byte length? Is it zero or smaller than the length of buffer we filled? Yuck! |
|
Losing the trailing byte when reading 1 by 1The console host, including v1, has a lot of provisions for storing that trailing byte in the input handle that made the request. It theoretically should be presenting it as the first thing up on the next read. Am I not seeing it because I'm using a thread to read? I had to change the screen buffer to allow shared reading to make my threaded read work... maybe its actually on a different handle and therefore I don't get the trail? |
Doesn't look like it's a threading thing. I had to allow shared on the test screen buffer because cooked input is on and it needs a handle to the output buffer so it can echo. So an unshared test screen buffer won't work. |
I have a theory on this. I think if we look in Lines 183 to 202 in 015675d
I think the only time that it will actually manage to store the trailing byte in the However, if we look in the method that is decoding the terminal/src/server/ApiDispatchers.cpp Lines 271 to 276 in ae550e0
So I'm not sure it's even possible to have the trailing byte get stored appropriately and be delivered later. But I've only looked at this for cooked read. So...
|
Still need to check.
Nope. Doesn't work. I tried zα and reading either 1 or 2 bytes. If you read 1, you get just the z. If you read 2, it sends back z then 0x83 0xbf and reports 3 bytes read, but only 2 ever fill the client app's buffer and it's just gone. I didn't find what in the driver/client servicing is preventing the overrun... but something is. And it's losing the extra bytes.
I believe the answer is yes. Also reading the client code, we should be returning the number of chars read as the number of the client application buffer filled. Not more than that. Something's saving the console system from the overrun, but we're being mean and inducing an overrun in the client. |
For cooked, raw....( For direct...( |
For direct... ( |
Oh no. Oh no no. |
So here it is. For conhostv1 (and thus as far back as we can tell for compatibility reasons):
The pitfalls include:
So my plan is to codify this in a series of tests against v1. Then add a switch in the tests for what the behavior SHOULD be for any app to actually be able to use these APIs (that is... take all those bugs/mistakes out.) And as a part of running the bugs out, I should also be lighting up UTF-8 on those APIs. This is technically a compatibility break, but given that the APIs were pretty much useless for as far back as we can determine providing garbage data to any app unfortunate enough to attempt to use them... we're going to call that acceptable risk for now. |
To future self: It's not safe to go alone, take this:
|
FYI large parts of this issue will be resolved by #15783. Some parts of conhost are still unaware about UTF-16 and UTF-8 however. But we're getting close. |
Do we have tracking issues for what parts of this aren't fixed by #15783? It kinda looks like everything linked here will be closed when that merges. |
#15783 has been merged. Is this issue still relevant? |
Hmm yeah I think it'd be fair to close this issue. There are still some edge cases but they aren't even mentioned in this issue. Thanks! |
[Claimed by miniksa]
This issue exists to resolve many issues around cooked read data processing including choking on the return of bytes in various A function calls from Read methods (with both DBCS and UTF-8 codepages), the improper display of Emoji or other U+10000 characters on the edit line (commonly seen in
cmd.exe
), and the general buffer grossness of mixmatching bothchar
andwchar_t
text within the same client-owned buffer space during waiting read operations.Related issues
These issues are related and should be improved or resolved by this work:
COOKED_READ_DATA
ctor takes astring_view
forwchar_t
data? #5618til::au16
andtil::u16a
conversion functions & make first use inWriteConsoleAImpl
#4493These issues should be easier to resolve after this work or be fixed by it:
Uncategorized:
This is what I expect to do:
_handlePostCharInputLoop
and revise it so it doesn't attempt to precount the number of "bytes" by counting the "width" of characters. Instead move to just translating the text and storing the excess, much like an aliased command would do.TranslateUnicodeToOem
has almost no usages anymore (andConvertToOem
) and could likely be converged withConvertToA
. Also that the fallback behavior ofTranslateUnicodeToOem
can be accomplished just by asking WC2MB to replace with the default character anyway.IsDBCSLeadByteConsole
or not.IsDBCSLeadByteConsole
and just calling thewinnls.h
exportedIsDBCSLeadByteEx
with the same codepage (and then not holding onto theCPInfo
stuff at all.)winnls.h
directly.CheckBisectStringA
because it only seems to have one consumer that's really just checking if the final cell of a rowIsDBCSLeadByteEx
...ConvertInputToUnicode
andConvertOutputToUnicode
are pretty darn close toConvertToA
andConvertToW
anyway. The only variations I can see in the pattern of using MB2WC and WC2MB are: no flags at all, putting the default character in when choking, or using glyph chars instead of ctrl chars. So why can't we just haveConvertToA
andConvertToW
have those modes and run all the translations through those and use the safer and more sensible string-in-string-out pattern to translate everything.ConvertToA
andConvertToW
... perhaps it's time fortil::u8u16
andtil::u16u8
to get theirtil::au16
andtil::u16a
variants brought in, have the flags added, and just have a unified way of converting things around here.CharToWchar
since its just translation of a short string (single character) but with the glyph for ctrl char substitution into atil::au16
with the glyph chars flag?The expected effects are primarily:
ReadConsoleA
,ReadFile
and such should work correctlyMore will show up in each of these headings as I discover it or we have feedback
The text was updated successfully, but these errors were encountered: