-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement grapheme clusters #16916
Implement grapheme clusters #16916
Conversation
5cb95f7
to
de50835
Compare
de50835
to
62b2cdb
Compare
fmt.Println(err) | ||
os.Exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
fmt.Println(err) | |
os.Exit(1) | |
log.Fatalln(err) |
{ | ||
char32_t codepoint = 0; | ||
const auto l = lead & 15; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: personally i prefer masking constants to be hex, but if you prefer decimal that is OK too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I'm using 15
because that way I find it easier to see that it makes indexing into a [16]
element large array safe.
@j4james If you ever get a chance to test this branch, I'd love to hear your thoughts on it! 🙂 In particular, Dustin and me have been talking about whether it's necessary for us to allow users to disable grapheme support. It would definitely be beneficial if it were (restores compatibility with conhost v2) but also disadvantageous (limits our freedom in designing the text storage). |
First off, I have to say this is very cool. However, I do think it needs to be disabled by default, because the majority of Linux apps will expect terminals to be compatible with As a simple example, trying pasting something like this into a WSL bash shell and then arrow back to the start of the line:
The first thing you'll notice is an extra flag getting rendered at the end of the line, and then when you arrow back, the cursor ends up part way through the prompt. Those flags consist of a flag emoji, a zero width joiner, and a rainbow emoji. The flag has a width of 1, ZWJ a width of 0, and the rainbow a width of 2, so most apps will expect the combination to be 3. But since we're treating it as a width of 2, we're out of sync with what bash is expecting, and that's how things end up breaking. The good news, though, is that there's a mode we can support ( I don't know the details well enough to know whether your implementation is compatible with what is required for mode |
One other thing to bear in mind is how this will work with third-party terminals. Assumedly for conpty to work correctly, the conhost width calculations and the terminal width calculations need to be in sync. So there needs to be a way for the terminal to tell conhost what algorithm it is using, or possibly what algorithms it can support (if, like Contour, the mode is switchable). Even ignoring the third party terminals, this would assumedly be necessary if you want to make it a user-controllable option in the Windows Terminal settings. I guess this is another use case for the conpty ioctl thing. |
I'm super happy to hear you feel this way! 😌 I apologize for the very long text below. I tried to delete as much as I could, but... oh well:
I don't wish to do this, but I am open to changing my mind. In short, I believe that the default mode of operation should be "forward looking" / progressive, with the optimistic mindset that we'll have more decades of terminal usage in front of us than behind us. I'm otherwise afraid that it puts less pressure on the general terminal landscape to be progressive. As far as I know the other modern terminals that support graphemes have the same stance.
(FYI your flag emoji is missing U+FE0F which is why it doesn't properly render. This works: 🏳️🌈🏳️🌈🏳️🌈.) I've tested this in Contour, WezTerm, kitty, foot, and Konsole and in all of them this failed to work in bash out of the box. I'm not sure whether there's something I need to configure to make it work in any of those terminals, but I believe many of them don't make this configurable. However, I believe this also somewhat validates my position on the choice of the default behavior above. In my personal opinion this is fine though, because most text is in the BMP and in its NFC form (= no combining marks). With such inputs, bash and other
Yes, I'm definitely intending to implement However, I don't believe Contour allows grapheme support to be turned off via this sequence. Contour calls it Edit: I found a support table! https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison
FWIW this is why I strongly believe that a conhost.lib is the right path forward. Even with an ioctl thingy, we'd not be able to ship every variant that people would need. fish-shell's widecharwidth for instance assigns unassigned and noncharacter codepoints a width of 0, but many other implementations, including the wcwidth implementations I'm aware of, as well as Windows Terminal so far, assign them a width corresponding to the EastAsian attribute. So even if we had a toggle to disable grapheme support, it would still have broken edge cases. With a in-process .lib it would at least work consistently correct according to the hosting terminal, as long as one doesn't use SSH (which requires something like the current ConPTY architecture). All of the above aside, I'm not against adding such a toggle per se. I'm just worried it'll either:
If you have any arguments in favor of adding a toggle, or know about any applications that this change would significantly regress, I'd be 110% open to add it. In fact the mere circumstance that you argued in favor it already made me heavily consider it. 😅 I hope the above gives a few good arguments though, as to why we may want to consider not having such a toggle. |
Out of curiosity I've tried to implement a If you have any tips on how to implement a wcwidth-like mode, I'll try to do so. |
Are you hoping that making bash not work properly in Windows Terminal will force them to change, and instead break in terminals like XTerm and Gnome Terminal? Because that seems very unlikely. And I can understand the "little guys" like Contour or WezTerm choosing to do their own thing - nobody is going to give a damn - but if Microsoft tries to use its "monopoly" position to force through breaking changes in Linux, that's probably not going to go down well. Now if XTerm and VTE were in on this too, I think it would be fine. But I can't imagine XTerm breaking decades of backwards compatibility, and recent discussions on the VTE issue tracker gave me the impression that they weren't keen on breaking things either (which surprised me actually). It's possible their views might change at some point, but until there is broader consensus amongst the main terminals, my recommendation would be to avoid deliberately breaking things. And just to be clear, I personally don't care about the defaults either way. I just don't want Windows Terminal to be seen as another one of Microsoft's EEE attempts.
I haven't been following this development very closely, but I suspect that is just because they haven't fully implemented the spec yet. The commit message "Starts adapting Terminal Unicode Core's DEC mode 2027" suggests it's still a work in progress. Certainly the spec is very clear about backwards compatibility: "Everything is disabled by default, so legacy apps don’t break more than they used to break already." That was the whole point of introducing a mode for this in the first place. If you don't care about backwards compatibility, there's really no need for this mode.
Well if you decide not be backwards compatible by default, then when you get a bug report about something like bash breaking, you can at least tell people there's an option they can toggle to fix it, rather than saying you're deliberately breaking apps to force them to change.
I thought it would just be a matter of dropping any zero width characters, and treating the other characters as either single or double width, as specified in Looking a the Then as you backspace, you can see it send Btw if you want to test with a wcwidth-compatible terminal on Windows, this works as expected in mintty (assuming you're using the wsltty version). |
This is going to seem like a stupid edge case, but I think it might be an indication that something is not quite right in the architecture. Try this in a WSL shell: printf "\U01f3f3\ufe0f\e7 \U01f308 DON'T MOVE"; sleep 1; printf "\e8\u200d\n" Initially it displays a flag and a rainbow with a space between them, but after a second they join together. This shrinks the combination down to two cells, but the "DON'T MOVE" text remains where it started. Now if you try the same thing without the sleep, the "DON'T MOVE" text appears three columns to the left. I'm assuming that's because the join happens before the conpty renderer has had a chance to refresh. That seems wrong to me. It should not be possible for text to end up in a different position just based on timing. And I think the underlying issue is that it shouldn't be possible to "join" two characters which have been output separately, just by inserting a |
I haven't fully debugged it yet, but I believe this shows two highly related issues:
We can solve both issues in two ways:
Personally speaking, I highly prefer the latter as it makes the "what you see" when we render on the screen consistent with "what you get" when you copy the text to your clipboard. It also simplifies our implementation, because the former option requires us to implement the workaround both, in ConPTY and in our various text renderers, whereas the latter is a simple if condition in Thoughts? |
That makes sense. And I don't think that's urgent to fix at this stage. The important thing is getting the width calculations correct first. The rendering issues can be dealt with later.
I think this would be my preference. My expectation was that these combining characters would be interpreted at the time the stream of text was received, rather than when rendering the buffer, and this seems like a neat way to accomplish that.
I don't think that's a big deal though. Because the minute you've moved the cursor, you're effectively in an edit mode - you're now changing what's in the buffer. You can't expect to read back text that've you overwritten. It's like writing |
This comment was marked as off-topic.
This comment was marked as off-topic.
The difference is that the current behavior is viewed as a bug. When someone complains about the widths being off somewhere, we point them to the ZWJ issue and say we're still working on it.
I apologize if I've misinterpreted your wording, and if there are teams, I am definitely on your side. I love what you're doing here, but I want to avoid giving ammunition to the groups that are looking for an excuse to attack MS.
If we are going to have this functionality enabled by default, then I think a settings toggle would be preferable to mode ?2027, just because it's more user-friendly. And I'm not sure it's even worth bothering with the mode if everyone is just going to default it on - that seems pointless to me That said, if we don't yet have an easy way to implement a toggle, then a mode is definitely better than nothing. I just want to make sure we have something we can tell people if anyone does complain. And again I must apologise if my earlier message has caused you a lot of stress. That's precisely what I was trying to avoid, and I just did a very poor job of it. |
And you know what, if the wcwidth-compatible algorithm is not a simple fix, maybe we should just leave it for a followup issue. If nobody complains, we can ignore it indefinitely. And if anyone does complain, we can at least say it's a known issue, and gather some real-world test cases before trying to address it. I'd hate for you to give up on this PR just because I'm being paranoid about hypothetical complaints. |
I was happy when I saw that you began implementing text segmentation into grapheme clusters. I think this is the right step in improving the console subsystem, despite the fact that this may go against the existing code base and de facto “standards” in the terminal world, up to changing the terminal architecture (internal text buffer) and rewriting a large amount of functionality literally from scratch. A console application may have its own interpretation of the Unicode standard, burdened by national peculiarities that the terminal cannot predict in any way, and it is necessary to somehow allow the console application to set the boundaries of grapheme clusters itself. If the application does not specify the boundaries of grapheme clusters and does not even make hints about this, then the terminal has the right to set the boundaries of the clusters as it pleases, either codepoint-wise or grapheme-wise. The need for this becomes obvious in any attempt to create a more or less competitive console text editor or a panel file manager like TotalCommander, in which there is no point in rendering text in codepoint, since there it is necessary to display grapheme clusters in the same way as GUI editors/file managers display them . The point here is not only in the plain text of the edited document, which can still be somehow perceived in a codepoint manner, but in the file names that the user sees in GUI mode exclusively in grapheme representation. This applies to both the Unix world and the Windows world. The fact that in the Unix world in terminals, text is displayed in codepoint, in my opinion this is a historical flaw associated with technical limitations, and sooner or later this flaw will be fixed. And it would be an even bigger mistake to rely on this codepoint behavior when developing a terminal emulator or console application. Edit: |
/cc @apparebit 😄 |
There are 3 errors:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i should probably block over #17364
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
51/54
} | ||
|
||
THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), flags, &_inPipe, &_outPipe, &_hPC)); | ||
THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), _flags, &_inPipe, &_outPipe, &_hPC)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah uh defterm handoff will always use the "default" value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this is definitely rough.
- We could add it as another property on
TERMINAL_STARTUP_INFO
though, we weren't smart enough to add adwSize
, so we'd have to rev theITerminalHandoff2
, s.t. we can make sure that an old conhost- Wait, no, we don't. TerminalHandoff is from a OpenConsole to a Terminal in the same package. We can just bump that.
- We could do the
closeOnExit: graceful
thing, where we know that because it's a defterm, it's going to always use the version from the registry- we'd need to cache that value when it's created
- Wait, would it use the value from the registry? Or fall into the category of "because conpty, we didn't even load
HKCU/Console
"
default: | ||
break; | ||
} | ||
CodepointWidthDetector::Singleton().Reset(mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh double f-ck, conpty and terminal will disagree on the measurement mode during defterm handoff!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! It's intentionally broken at the moment until I sort out IsGlyphFullWidth
.
@@ -213,6 +222,7 @@ class Settings | |||
std::wstring _LaunchFaceName; | |||
bool _fAllowAltF4Close; | |||
DWORD _dwVirtTermLevel; | |||
SettingsTextMeasurementMode _textMeasurement = SettingsTextMeasurementMode::Graphemes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider having a different default for inbox (in 3 years?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea to me. I'm sure in the 5 years before we get this inbox, we'll have any long tail of bugs sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion: I consider it unlikely that this will break any applications. After all, what use would printing a zero-width character have in conhost at the moment? In particular if we consider for how long our ExtTextOutW
support was broken... I think this will be a non-issue.
If it does become an issue, we do have the registry key someone could set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
49/54 but i see you're reviewing feedback now so here's basically me +1'ing dustin's thoughts
#define PSEUDOCONSOLE_GLYPH_WIDTH__MASK 0x18 | ||
#define PSEUDOCONSOLE_GLYPH_WIDTH_GRAPHEMES 0x08 | ||
#define PSEUDOCONSOLE_GLYPH_WIDTH_WCSWIDTH 0x10 | ||
#define PSEUDOCONSOLE_GLYPH_WIDTH_CONSOLE 0x18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝: 0x18
here is "both the PSEUDOCONSOLE_GLYPH_WIDTH
bits", which means console. Clever way to pack three enum values into two flags
} | ||
|
||
THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), flags, &_inPipe, &_outPipe, &_hPC)); | ||
THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), _flags, &_inPipe, &_outPipe, &_hPC)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this is definitely rough.
- We could add it as another property on
TERMINAL_STARTUP_INFO
though, we weren't smart enough to add adwSize
, so we'd have to rev theITerminalHandoff2
, s.t. we can make sure that an old conhost- Wait, no, we don't. TerminalHandoff is from a OpenConsole to a Terminal in the same package. We can just bump that.
- We could do the
closeOnExit: graceful
thing, where we know that because it's a defterm, it's going to always use the version from the registry- we'd need to cache that value when it's created
- Wait, would it use the value from the registry? Or fall into the category of "because conpty, we didn't even load
HKCU/Console
"
@@ -213,6 +222,7 @@ class Settings | |||
std::wstring _LaunchFaceName; | |||
bool _fAllowAltF4Close; | |||
DWORD _dwVirtTermLevel; | |||
SettingsTextMeasurementMode _textMeasurement = SettingsTextMeasurementMode::Graphemes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea to me. I'm sure in the 5 years before we get this inbox, we'll have any long tail of bugs sorted
break; | ||
case DispatchTypes::ModeParams::XTERM_BracketedPasteMode: | ||
enabled = _api.GetSystemMode(ITerminalApi::Mode::BracketedPaste); | ||
state = _api.GetSystemMode(ITerminalApi::Mode::BracketedPaste) ? DispatchTypes::DECRPM_Enabled : DispatchTypes::DECRPM_Disabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Seems like we could just special case GCM_GraphemeClusterMode
at the bottom of the switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think I get it. Let's go wild.
src/types/CodepointWidthDetector.cpp
Outdated
|
||
switch (glyph.size()) | ||
// The _state is stored ~flipped, so that we can differentiate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds like a boolean with more steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wanted the struct to be zero-initializable with minimal overhead so that the calls are cheap. Initializing the member to ~0 = 255 by default would break that, as would adding a bool member (adds padding = bad).
There are ways to make all of this categorically better but I don't think it's time for that yet. First we need to get rid of the IsGlyphWide and ROW::_charOffsets glue. :)
Afterwards we can make it so that we segment entire grapheme runs at a time (instead of 1 grapheme at a time) which amortizes the call overhead to 0 and then it doesn't matter anymore how the struct looks like.
First, this adds
GraphemeTableGen
whichucd.nounihan.grouped.xml
Next, this adds
GraphemeTestTableGen
whichGraphemeBreakTest.txt
CodepointWidthDetector.cpp
was rewritten from scratch toGraphemeState
) to maintain statewcswidth
-style, and a mode identical to the old conhostWith this in place the following changes were made:
ROW::WriteHelper::_replaceTextUnicode
now uses the newgrapheme cluster text iterators
contents of the current cell if they join to form a cluster
of the measurement mode over from WT's settings to ConPTY
This is part of #1472
Validation Steps Performed