-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression vs legacy conhost: different behavior from ReadConsoleOutputCharacterW when surrogate pair(s) are present. #16892
Comments
I'm thinking this is in the neighborhood of #13339 |
I could potentially understand cutting the ability to read such lines. Although Windows has always been able to access the console buffer, *nix has never been able to. At a minimum, though, the API should return failure (false) when it fails. The main problem in the case at hand wasn't even the failure to read the console text; the main issue was the API reported that it succeeded but it had actually failed. Returning failure would have avoided the real world usage problem linked above, since the code was already actively checking for failures. Ideally the API would succeed and behave how it did in legacy conhost -- but if it won't, then returning that there was a failure lets the app handle fallback appropriately. |
The regression is not intentional and unrelated to the intentional ones in #13339, but it is in the neighborhood. I think it's caused by #13626 which added proper support for surrogate pairs but changed the behavior of This internal regression causes the buffer to be filled with (for instance) 121 chars in a 120 column terminal, but you're only giving a 120 char buffer in your repro. The following code then drops the returned text and still returns terminal/src/host/directio.cpp Lines 816 to 821 in ba34abb
I've had a brief glance at the original conhost v1 code and it doesn't seem like this behavior existed there, so I believe this is another regression that was introduced in the original conhost v2 release. I'm submitting a PR to fix this issue soon. It'll then correctly count the number of characters again. |
@lhecker ah ok that makes sense. Thank you for working on a fix quickly! It sounds like you're aware of everything I'm about to add, but for completeness, I'll still call these out overtly: The trouble with 120 vs 121, is that the API signature uses the
So, if the screen width is 120 and a caller passes 121, then on a line with no surrogate pairs the 121st character will be read from the next line. And that requires a caller to post-process the results to figure out how much of the returned content corresponds to the target line. A caller that has a robust Unicode iterator could ask for e.g. 2 * the screen width and then parse the returned content, measuring widths (e.g. with the wcwidth functions) until they exceed the width of the screen. But in a line that contains zero width joiners (or emoji qualifiers or etc), it could require much more than 2 * the screen width. (And my projects do have a robust Unicode iterator, so anywhere they need the full actual content, they can indeed use iterative calls -- but for the most part they just want to know if a line is empty, i.e. contains only spaces.) I recognize that it's problematic to fit Unicode into the console APIs whose original signatures aren't detailed enough to accommodate surrogates and joiners (or anything else beyond UCS2). I really like the balanced compromises that the API has made for that over the years. For example, a caller can iteratively call and analyze the returned output and piece together an accurate representation over multiple calls (or a single extra-large calls and then parse the results), regardless of how many WCHARs are required to fetch a full screen line of text. That's pretty cool, albeit complex, and gives callers the ability to operate to whatever degree of "correctness" or "thoroughness" that they wish to. Thanks again for the speedy response and follow-up! |
I've also been wondering if we should just replace surrogate pairs with U+FFFD to make this behave more consistent. While I also agree with your point ("gives callers the ability to [...]"), the support for surrogate pairs makes this API different from the |
Have there been complaints about the existing behavior over the past 20+ years that it's behaved the current way? It would be a breaking change. Personally, I'm very happy about how it's been working for years. But since English is my primary language, surrogate pairs are an outlier for me and only appear when certain nerdfont icons or emoji are present. It might be useful to gather opinions from software developers for whom surrogate pairs are common in normal language text. If there are neither complaints nor feedback, then I would lean towards preserving compatibility with previous behavior. |
@lhecker also, if surrogate pairs would be stripped, then I assume the intent would also be to strip zero width joiners, and all combining marks, right? There's a lot of room for complications and impactful side effects. |
That's true, sort of. The console so far was (in pseudocode terms) a The reason your repro case worked at all is because that broken UTF16 implementation coincidentally counted the narrow cell with a surrogate pair as a wide cell and thus the iterator stepped over both the leading and trailing surrogate pair. In other words, before I "fixed" it, it actually thought that the row had 121 columns (as an example), even though the buffer is only 120 wide. This caused a ton of issues and regressions elsewhere. So, while it's definitely a breaking change by definition (= change in behavior causing failures in other components), I don't believe that the exact contents that can be read via these APIs are properly specced out yet and never were. Sure, But let me get to a much more concerning point...
This is exactly my biggest concern right now. I'm planning to add proper Unicode support with extended grapheme clusters in the near term. That is, support for ZWJs, VS15/16, combining marks, and so on. I'm rather worried that this will make Thoughts? |
@lhecker that all makes sense. I can confirm that for my own purposes, the proposed change to RCOCW() would not cause any negative side effects in any of my projects (not even for users in locales where surrogate pairs are common). And specifically, Clink will not encounter problems from such a change. Because none of my projects do anything more than look for empty lines, anymore (well, Clink can "scrape" the screen for text to use in completions, but it's ok for that to miss some things). Something to consider for the future:It would be fantastic to have some way to save/restore the terminal screen (or a region within the terminal screen), and preserve arbitrarily complex text and colors and styles. I don't care whether it's an escape code or an API, and it would be perfectly fine for the saved state to be opaque (e.g. if an API filled an out-param buffer with some opaque payload that only the API knows how to unpack and restore -- though of course someone would try to "crack" the payload, and the API would need to guard against fuzzing, etc). The Alternate Screen Buffer private mode codes (e.g. FWIW, I stopped using ReadConsoleOutputW once it couldn't preserve colors and styles beyond the 4-bit VGA colors. But, I strongly wish there were a way to save and restore either the whole screen or a region of the screen (including colors, styles, and full Unicode text). The problem with the guidance "apps should keep track of what they print" is that sometimes an app wants to show a temporary popup window in a region of the screen that was printed by some other app. PowerShell's progress bars is one example, the Azure CLI installer is another example (but maybe it's just another PowerShell progress bar, I'm not sure), and there are several places in my own projects where I want to save/restore a region of the screen, but no longer can. In the case of Clink, it scrolls the screen to make room for the popup at the bottom of the screen where it doesn't overlay anything else -- it's functional, but it yields an unpolished and less pleasant experience. |
For better or worse, the question about whether to use U+FFFD or not has been answered... I'll make it return U+FFFD for now as that was my plan for the |
throwaway thought before I leave for lunch: #10810 sounds a lot like what you want |
Yes @zadjii-msft that'd be a great way for save/restore to work (i.e. to use pages and copy back and forth between pages, which I think are essentially a collection of alternate screen buffers). It's currently NYI, though, right? And yes, Far is an excellent example of an app that relies on accurately saving/restore content printed by other apps. Any solution that's viable for Far would meet even all of the "stretch" goals that I've ever had for what I would need from a save/restore mechanism. |
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅ (cherry picked from commit 373faf0) Service-Card-Id: 92121912 Service-Version: 1.20
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates the number of columns that should be read. For single-column (narrow) surrogate pairs this previously clipped a trailing character of the returned string. In the major Unicode support update in #13626 surrogate pairs truly got stored as atomic units for the first time. This now meant that a 120 column read with such codepoints resulted in 121 characters. Other parts of conhost still assume UCS2 however, and so this results in the entire read failing. This fixes the issue by turning surrogate pairs into U+FFFD which makes it UCS2 compatible. Closes #16892 * Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW` * Read succeeds with a single U+FFFD ✅ (cherry picked from commit 373faf0) Service-Card-Id: 92129542 Service-Version: 1.19
Windows Terminal version
1.19.10573.0
Windows build number
10.0.19045.4046
Other Software
This is an API issue, and affects any software that calls
ReadConsoleOutputCharacterW
.The API issue only repros with Windows Terminal, not with legacy conhost (nor with ConEmu, ConsoleZ, etc).
For example clink was affected by this when also using eza or dirx (see here for details of how this was encountered during real world usage).
Steps to reproduce
When a line of text in the console screen buffer contains one or more surrogate pairs, then the behavior
ReadConsoleOutputCharacterW
API does not match the documented contract, and is different from the behavior with legacy conhost.The attached repro program demonstrates the behavior:
repro_surrogate_pairs_issue.zip
Repro:
WriteConsoleW
to print a line of text that includes one or more surrogate pairs.ReadConsoleOutputCharacterW
to read back the same line.Easy demonstration program:
Expected Behavior
ReadConsoleOutputCharacterW
should:lpNumberOfCharsRead
with the number of characters read (e.g. the width of the console).lpCharacter
with characters read from the console.The demo program should first write 4 lines, then verify that reading the lines back matches what was originally written.
Each line contains Unicode codepoints that correspond to certain nerdfonts icons.
Sample expected output:
Actual Behavior
Only in Windows Terminal (all versions; 1.19, 1.20, 1.20 canary):
lpNumberOfCharsRead
with 0.lpCharacter
.Problem 1: It reads nothing; it should read text successfully, the same as in legacy conhost.
Problem 2: It reports success; that's inaccurate since it failed to read the text that was present.
Sample actual output:
The text was updated successfully, but these errors were encountered: