-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grapheme cluster & unicode v13 support #3304
Comments
Thanks for pointing me to this ticket @jerch. I have been actually implementing grapheme cluster support in my terminal emulator. My motivation behind all the pain was to be able to properly render complex emoji in the terminal. Now I'd like to address some of your points above:
I think that's not a problem, because in non-supporting client apps it will be highly unlikely that that application will render complex Unicode anyways.
This is right, but that should not affect grapheme cluster aware terminals. In my opinion. In the context of a terminal, historically every displayed character usually was 1 grid cell wide. And because of that, recent terminals that started to support displaying emoji did indeed render emoji over 2 grid cells, but did not increment the cursor by 2 cells to not break existing applications. I think that wasn't that bad, at least we had emoji. But most terminals are not grapheme cluster aware (seriously, are there any?) Now, grapheme clusters can be used to determine how many consecutive codepoints should be rendered as "one user perceived character". Characters on the terminal are usually in one grid cell, and a few in 2 grid cells for wide characters (such as emoji, kanji, ...). Implementing grapheme cluster segmentation in a terminal would help putting the right "non-breakable" sequence of codepoints into one grid cell - just like the user would expect them to be (as it's one user perceived character when rendered).
AFAIR, the first codepoint mandates the EastAsian width unless VS15/VS16 is involved which explicitly states to enforce the emoji presentation to either text (VS15, narrow, 1 character cell) or emoji (VS16, wide, 2 char cells). Grapheme clusters are not all out of the suddon making text look vertical (ref to "both directions"). I think not even mlterm is doing that. Even RTL is a hard (but apparently solved) problem (again: mlterm, gnome-terminal to some degree?). But I think either should hinder properly interpreting grapheme cluster boundaries.
I think what you are referring to is that for example the family emoji of course usually looks like a family, but some applications might render it as a sequence of emoji symbols. Or a colored person emoji might alternatively rendered as standard (I think: white) colored emoji with the color modification emoji symbol right next to it. The quote from the UTS 51, section 2.2 is as follows:
To me, that is not as bad as it sounds, unless one plans to intentionally implement a TE that does fall under the above mentioned category of "not supported". Everyone should at least attempt to support and that should not hinder grapheme segmentation boundary determination in a TE (IMHO). I hope this is not too much of a wall of text. p.s.: In case legacy applications might still be of concern to some, we could agree on a feature detection number that could be queried by client apps that would want to distinguish. The user could make that feature per config default enabled/disabled and an SM/RM code could be even introduced to allow apps to decide whether or not grapheme segmentation should be obeyed. (Even though, personally I'd highly advocate to always respect grapheme clusters). |
Haha it is quite the wall, nonetheless trying to get through.
Thats true for clusters, but not so for single codepoint runwidths. There are several unicode release, where they changed the widths, which creates mis-alignments for any app trying to anticipate wcwidth (I think v8 to v9 was the worst upgrade regarding that). Btw emojis used to be just 1 cell in earlier releases, but with the picograms most moved to 2 cells. And to make it worse - there are some emojis, that still map to 1 cell in text representation, but 2 cells for the picogram, lol.
Yes I think putting them in one cell is the only right way to deal with them, even if they are longer. But this raises a few questions about cell accounting in TEs, whether the emulator can mark a cell spanning multiple half width cells, and how to deal with those "super cells" during reflow and render. To fix TEs in this regard is imho the hard part of a halfway decent implementation.
Well that "breaking the aligment in both directions" is a thing in browser font renderers, as I have it experienced when doing the original grapheme PR. If you happen to render glyphs on your own, you can always align things as you please (up to totally unreadable, because 20 codepoints got just painted into one text cell). In xterm.js we are somewhat limited in that regard and have to go with what the renderer offers us for combining codepoints. Have not done any BiDi stuff yet.
In xterm.js we are bound to what the browser/system offers us (browser engine, font renderer, installed fonts). There is no way in attempting to get the compound glyph, the combination of those outer dependencies can either show it or not. For the family emoji this gets really funny. And what shall appside do with that? The problem is renderer stage bound, appside has no knowledge about that. Neither do multiplexers. Thats the reason why I think we might need a lookup sequence - to give apps/multiplexer a way to do correct About segmentation algo speed:
Ah I am not a big fan of that mixed legacy mode, as it just messes up unicode feature support. Imho if a TE claims v13 support, it should also handle graphemes, as they are quite often (Guess we had at least 20 emoji issues, and still chasing the unicode rabbit). Well created another text wall 😸 |
Better be tracked by #2668. |
Issue to track grapheme cluster and follow-up unicode version support.
Currently we are not grapheme cluster aware which leads to several issues around complex combining unicode characters like compound emojis and scripting systems. Final goal should be to handle most newer aspects of Unicode 11+ with a dedicated v13 addon.
TODO
wcwidth
handling inInputHandler.print
with grapheme rulesetbonus goals
Limitations
While support for graphemes and v13 will solve several output issues for compound characters on newer OS, it certainly will not solve all unicode related issues:
useCompoundUnicodeRenderWidth = True
) to tell the renderer always to use the real available glyph width. The runwidth ambiguity for app side here could be solved with an additional terminal sequence requesting the real runwidth prior usage (e.g. a chat app that knowns its used compound emojis can request all runwidths before entering the main texting loop).The text was updated successfully, but these errors were encountered: