-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Text Buffer rewrite #8000
Comments
Alright, I've claimed this issue for the text buffer updates. |
TAB characters in #7810 could also be stored this way. |
@KalleOlaviNiemitalo Indeed! I'd thought about this when designing the new advances-based buffer, but I was a little worried about the implications. We don't have good coverage for what happens when the cursor moves into a column covered by a wide glyph, and I think nailing that down will be important for getting tab-in-buffer working. |
Actually, on second thought ... this isn't as bad as I'd expected. We already move into multi-cell glyphs, and we don't damage them properly when they're overstruck. We don't render the cursor with the correct width when we're inside such a glyph either... When the new damage and column-to-region mapping algorithms, this may be a breeze. |
Will this rewrite support a TextBuffer >= 32767? Currently ROW._id is a SHORT and TextBuffer size is a COORD, which is also composed of SHORTs. Changing TextBuffer COORD references to til::point mean a major rewrite of half the code, but it's needed if you want to support Infinite Scrollback (with files). |
@DHowett I don't think this is as simple as you think. Lets say you write out the string Things like a zero-width-joiner are even more problematic. Let's say you write out a waving white flag ( And whatever you do has to be compatible with the way other terminal emulators behave, and more importantly, the way apps expect the terminals to behave, otherwise they just won't work correctly in Windows Terminal. I suspect some of the things you're envisioning here are simply not possible. |
@j4james I'm not terribly concerned about tab - it's an auxiliary consideration for this spec. If we wanted to replicate iTerm's behavior here, it could only replace up to the next ZWJ is another "future consideration". I'm hoping that this is more robust for now, rather than fully scalable for the future. N:M glyph mapping, inserting ZWJs, etc. is more impossible with what we have than this specification. I wouldn't expect overstriking a non-joining space with a joining one to ever join adjacent columns once they're frozen in place in the text buffer. The behavior would, rather, be "ZWJ only attaches characters if they are written consecutively, and a cursor move breaks a character-in-composition" or "ZWJ is zero-width and always attaches to the character stored in the prior column." Either of those would be more possible than what we're working with today. It doesn't necessarily mean they're right, it just means that our horizon expands to better consider them. All that, though, is not in scope right now. Incremental progress 😄 while retaining compatibility with what applications (using both the Win32 Console APIs and VT) do today are my goal. |
@naikel breaking any dependency we have on COORD internally is a significant bonus. |
@DHowett I tried once to extend TextBuffer >= 32767 changing ROW._id from SHORT to size_t and TextBuffer COORD to til::point and after three days failed miserably. I think it's easier to just code a whole new terminal from scratch. EDIT: I really hope you can do this! |
Thoughts for replacing the global function pointer that determines glyph width (which the Renderer uses to answer for ambiguous glyphs). struct IMeasurer {
// Measures one codepoint, stored as UTF-16 code units
virtual CodepointWidth MeasureCodepoint(std::wstring_view) = 0;
// Measures multiple codepoints, stored as a string of UTF-16 code units.
// This function should (?) return 0-width code units for combining characters
// if and only if they would actually combine. Use Renzhi's new measuring algorithm here.
virtual std::vector<CodepointWidth> MeasureString(std::wstring_view) = 0;
} struct UnicodeUCDMeasurer : public IMeasurer {
// implements it using raw compiled-in unicode UCD, never asks anyone else, totally static
} struct RendererFallbackMeasurer : public IMeasurer {
IMeasurer* rootMeasurer;
RenderEngine* renderEngine;
// IFF rootMeasurer returns Ambiguous widths for any individual codepoints, ask the render engine
// This is only required for conhost, where we **must** maintain compatibility with "font dictates display width"
} |
Can I ask what Renzhi's algorithm is? My concern is if it doesn't match the behaviour of the wcwidth routines that Linux apps often use, then those apps are just not going to work correctly in WT, no matter how brilliantly we render zero-width joiners. Also how does this relate to what is stored in the buffer? Is the intention to measure before deciding how something gets stored, or is the measuring to determine how the buffer is rendered? The reason I ask, is because if element N in the row buffer doesn't map to column N on the display, then some VT operations will also not work correctly. This can even be a problem for client-side operations, like search and selection, although that at least is something we could maybe compensate for. And I know the DX renderer already has issues with the buffer to screen mapping, but I'd rather it didn't get worse. |
Renzhi is a MS employee like me who is moonlighting this project. He is an expert on font, Unicode and layout. He has been working with us internally to improve the overall Unicode support by providing an algorithm for general text width measurement.
获取 Outlook for iOS<https://aka.ms/o0ukef>
|
OK, thanks for info. As long as you guys are confident that this is going work out OK, feel free to ignore my questions - I don't need to know all the details. The main thing was I wanted to make sure you weren't overlooking something in the design that might later turn out to be a problem. |
since we said his name 3 times: @reli-msft |
… Emoji support (#15567) This is a complete rewrite of the old `WriteCharsLegacy` function which is used when VT mode is disabled as well as for all interactive console input handling on Windows. The previous code was almost horrifying in some aspects as it first wrote the incoming text into a local buffer, stripping/replacing any control characters. That's not particular fast and never was. It's unknown why it was like that. It also measured the width of each glyph to correctly determine the cursor position and line wrapping. Presumably this used to work quite well in the original console code, because it would then just copy that local buffer into the destination text buffer, but with the introduction of the broken and extremely slow `OutputCellIterator` abstraction this would end up measuring all text twice and cause disagreements between `WriteCharsLegacy`'s idea of the cursor position and `OutputCellIterator`'s cursor position. Emoji input was basically entirely broken. This PR fixes it by passing any incoming text straight to the `TextBuffer` as well as by using its cursor positioning facilities to correctly implement wrapping and backspace handling. Backspacing over Emojis and an array of other aspects still don't work correctly thanks to cmdline.cpp, but it works quite a lot better now. Related to #8000 Closes #8839 Closes #10808 ## Validation Steps Performed * Printing various Unicode text ✅ * On an fgets() input line * Typing text works ✅ * Inserting text works anywhere ✅ * Ctrl+X is translated to ^X ✅ * Null is translated to ^@ ✅ This was tested by hardcoding the `OutputMode` to 3 instead of 7. * Backspace only advances to start of the input ✅ * Backspace deletes the entire preceding tab ✅ * Backspace doesn't delete whitespace preceding a tab ✅ * Backspacing a force-wrapped wide glyph unwraps the line break ✅ * Backspacing ^X deletes both glyphs ✅ * Backspacing a force-wrapped tab deletes trailing whitespace ✅ * When executing ```cpp fputs("foo: ", stdout); fgets(buffer, stdin); ``` pressing tab and then backspace does not delete the whitespace that follows after the "foo:" string (= `sOriginalXPosition`).
`COOKED_READ_DATA` is a little special and requires cursor navigation based on the raw (buffered) text contents instead of what's in the text buffer. This requires the introduction of new helper functions to implement such cursor navigation. They're made part of `TextBuffer` as these helpers will get support graphemes in the future. It also helps keeping it close to `TextBuffer` as the cursor navigation should optimally behave identical between the two. Part of #8000.
The ultimate goal of this PR was to use ICU for text search to * Improve Unicode support Previously we used `towlower` and only supported BMP glphs. * Improve search performance (10-100x) This allows us to search for all results in the entire text buffer at once without having to do so asynchronously. Unfortunately, this required some significant changes too: * ICU's search facilities operate on text positions which we need to be mapped back to buffer coordinates. This required the introduction of `CharToColumnMapper` to implement sort of a reverse-`_charOffsets` mapping. It turns text (character) positions back into coordinates. * Previously search restarted every time you clicked the search button. It used the current selection as the starting position for the new search. But since ICU's `uregex` cannot search backwards we're required to accumulate all results in a vector first and so we need to cache that vector in between searches. * We need to know when the cached vector became invalid and so we have to track any changes made to `TextBuffer`. The way this commit solves it is by splitting `GetRowByOffset` into `GetRowByOffset` for `const ROW` access and `GetMutableRowByOffset` which increments a mutation counter on each call. The `Search` instance can then compare its cached mutation count against the previous mutation count. Finally, this commit makes 2 semi-unrelated changes: * URL search now also uses ICU, since it's closely related to regular text search anyways. This significantly improves performance at large window sizes. * A few minor issues in `UiaTracing` were fixed. In particular 2 functions which passed strings as `wstring` by copy are now using `wstring_view` and `TraceLoggingCountedWideString`. Related to #6319 and #8000 ## Validation Steps Performed * Search upward/downward in conhost ✅ * Search upward/downward in WT ✅ * Searching for any of ß, ẞ, ss or SS matches any of the other ✅ * Searching for any of Σ, σ, or ς matches any of the other ✅
Subjectively speaking, this commit makes 3 improvements: * Most importantly, it now would work with arbitrary Unicode text. (No more `IsGlyphFullWidth` or DBCS handling during reflow.) * Due to the simpler implementation it hopefully makes review of future changes and maintenance simpler. (~3x less LOC.) * It improves perf. by 1-2 orders of magnitude. (At 120x9001 with a full buffer I get 60ms -> 2ms.) Unfortunately, I'm not confident that the new code replicates the old code exactly, because I failed to understand it. During development I simply tried to match its behavior with what I think reflow should do. Closes #797 Closes #3088 Closes #4968 Closes #6546 Closes #6901 Closes #15964 Closes MSFT:19446208 Related to #5800 and #8000 ## Validation Steps Performed * Unit tests ✅ * Feature tests ✅ * Reflow with a scrollback ✅ * Reflowing the cursor cell causes a forced line-wrap ✅ (Even at the end of the buffer. ✅) * `color 8f` and reflowing retains the background color ✅ * Enter alt buffer, Resize window, Exit alt buffer ✅
Subjectively speaking, this commit makes 3 improvements: * Most importantly, it now would work with arbitrary Unicode text. (No more `IsGlyphFullWidth` or DBCS handling during reflow.) * Due to the simpler implementation it hopefully makes review of future changes and maintenance simpler. (~3x less LOC.) * It improves perf. by 1-2 orders of magnitude. (At 120x9001 with a full buffer I get 60ms -> 2ms.) Unfortunately, I'm not confident that the new code replicates the old code exactly, because I failed to understand it. During development I simply tried to match its behavior with what I think reflow should do. Closes #797 Closes #3088 Closes #4968 Closes #6546 Closes #6901 Closes #15964 Closes MSFT:19446208 Related to #5800 and #8000 ## Validation Steps Performed * Unit tests ✅ * Feature tests ✅ * Reflow with a scrollback ✅ * Reflowing the cursor cell causes a forced line-wrap ✅ (Even at the end of the buffer. ✅) * `color 8f` and reflowing retains the background color ✅ * Enter alt buffer, Resize window, Exit alt buffer ✅ (cherry picked from commit 7474839) Service-Card-Id: 90642727 Service-Version: 1.19
can someone update the issue description? #30 seems to be the only remaining open issue at this point. |
With only a few places left that rely on the old grid-based buffer layout, all of which require some long term, but focused work, I'll be going ahead and close this more general issue. |
This is the issue tracking the great buffer rewrite of 202x.
Aims
UnicodeStorage
(which is a lookup table keyed on row+column)ROW::_id
,ROW::_pParent
,CharRow::_pParent
CharRow
,CharRowCell
,CharRowCellReference
Notes
Surrogate Pairs
Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.
I have done a small amount of work in WriteCharsLegacy. It is slow going.
Motivation
#8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.
Implementation
The buffer is currently stored as a vector (small_vector) of
CharRowCell
, each of which contains aDbcsAttribute
and awchar_t
. Each cell takes 3 bytes (plus padding, if required.)In the common case (all narrow text), this is terribly wasteful.
To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of
til::rle
(#8741).A column count of zero indicates a code unit that is a continuation of an existing glyph.
Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.
Work Log
til::rle<T, S>
- a run length encoded storage template, which we will use to store column countsOther issues that might just be fixed by this
TextBuffer::Reflow
performs poorly for large buffers #4968The text was updated successfully, but these errors were encountered: