-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcell breaks grapheme clusters #264
Comments
So the challenge is that tcell needs to be able to take an array of runes, and determine where the boundary for the grapheme is, and also the width in display cells of that grapheme. (Frankly, Unicode is making this all nuts, by going crazy adding new encodings and new combining schemes. I mostly blame emoji although some natural languages do make this complex. I believe that what should happen here is that the application should supply graphemes, with the Now even given those, I need to determine whether the grapheme occupies one cell, or two, or (in some cases) none. This lets me figure out what I'm actually going to emit, and determine how many columns are used. If you could figure out how to combine the width calculation with the grapheme bits (e.g. a Graphemes.DisplayWidth() API), that would be really really helpful, I think. I suspect you could probably add that to the table property you already have. DisplayWidth needs to cope with the fact that some character widths are "ambiguous", so probably it should return values for "narrow", "wide", and "ambiguous". Also "none" for zero width runes like control characters. |
As I think more about this and review the code, I think you're right. The logic there was intended to strip out bogus characters (non-combining) as those would normally be application errors, but I think this was a mistake in retrospect. I suspect tcell doesn't even need to segment at all... but the views package (which displays strings) would need this via your uniseg package. We can probably get away with using the first rune to determine the width. |
See this for details on width: http://www.unicode.org/reports/tr11/ |
Thanks a lot! Regarding the width of grapheme clusters (see also mattn/go-runewidth#28), I think what we want instead of (or in addition to) fmt.Printf("%d\n", runewidth.RuneWidth(0x070f)) // Outputs "0". ( My best guess at this point is to use the width of the first non-zero-width rune for the entire grapheme cluster. I could be wrong but I guess I'll try it for now in I don't know how much time @mattn has at the moment to work on this. I've added my own implementation to |
The following code crashes
tcell
:Panic trace:
I've spent the last few days on this topic and realized we (
tview
,tcell
, andgo-runewidth
) have all been rolling our own Unicode code point combination/splitting. Mostly by handling Modifiers (e.g. "a" (0x61) + "◌̈" (0x308) = "ä") and recently (#233) adding some Zero-Width Joiner (ZWJ) support.It turns out that the topic is much more sophisticated and Unicode defines specific rules on what constitutes a character (the so-called "grapheme cluster"). They can be found in Unicode Standard Annex 29, specifically Section 3.1.1. (Modifiers and ZWJs are just one of the 14 rules).
The document also describes how to deal with flags, or "regional indicator symbols", as in the code example above.
I've published a new package https://github.com/rivo/uniseg which implements these rules so we can now split any string into its characters according to Annex 29.
Regarding
tcell
and the issue above, I actually think it doesn't need to handle any of this. It should be up to the caller to determine what is a character and pass those code points on totcell
. TheSetContent()
function should, in my opinion, not drop any runes from the combining characters slice. (Andrunewidth.RuneWidth(r) != 0
is not a good indicator for "invalid code points" anyway, as we saw in the panic above.) If I want to print "🏳️🌈", I send its code points0x1f3f3 0xfe0f 0x200d 0x1f308
totcell
and those should be written to the terminal, no need to handle the Modifier or ZWJ in any special way. (And I know the macOS terminal, and likely others, too, will render the flag correctly.)So my request would be to remove lines 52-65 from cell.go. I'm not sure if "combining characters" are dealt with specifically in other places of
tcell
. If so, those should also be reviewed, I think.Please let me know what you think about this.
The text was updated successfully, but these errors were encountered: