tcell breaks grapheme clusters #264

rivo · 2019-03-14T08:19:53Z

The following code crashes tcell:

package main

import "github.com/gdamore/tcell"

func main() {
	text := []rune("🇩🇪")
	screen, _ := tcell.NewScreen()
	screen.Init()
	screen.SetContent(0, 0, text[0], text[1:], tcell.StyleDefault.Foreground(tcell.ColorWhite))
	screen.Sync()
	screen.PollEvent()
	screen.Fini()
}

Panic trace:

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/gdamore/tcell.(*CellBuffer).SetContent(0xc000094360, 0x0, 0x0, 0x1f1e9, 0xc0000b2774, 0x1, 0x1, 0x400000f00000000)
/github.com/gdamore/tcell/cell.go:61 +0x384
github.com/gdamore/tcell.(*tScreen).SetContent(0xc000094340, 0x0, 0x0, 0x1f1e9, 0xc0000b2774, 0x1, 0x1, 0x400000f00000000)
/github.com/gdamore/tcell/tscreen.go:435 +0xba
main.main()
/test.go:9 +0xbc
exit status 2

I've spent the last few days on this topic and realized we (tview, tcell, and go-runewidth) have all been rolling our own Unicode code point combination/splitting. Mostly by handling Modifiers (e.g. "a" (0x61) + "◌̈" (0x308) = "ä") and recently (#233) adding some Zero-Width Joiner (ZWJ) support.

It turns out that the topic is much more sophisticated and Unicode defines specific rules on what constitutes a character (the so-called "grapheme cluster"). They can be found in Unicode Standard Annex 29, specifically Section 3.1.1. (Modifiers and ZWJs are just one of the 14 rules).

The document also describes how to deal with flags, or "regional indicator symbols", as in the code example above.

I've published a new package https://github.com/rivo/uniseg which implements these rules so we can now split any string into its characters according to Annex 29.

Regarding tcell and the issue above, I actually think it doesn't need to handle any of this. It should be up to the caller to determine what is a character and pass those code points on to tcell. The SetContent() function should, in my opinion, not drop any runes from the combining characters slice. (And runewidth.RuneWidth(r) != 0 is not a good indicator for "invalid code points" anyway, as we saw in the panic above.) If I want to print "🏳️‍🌈", I send its code points 0x1f3f3 0xfe0f 0x200d 0x1f308 to tcell and those should be written to the terminal, no need to handle the Modifier or ZWJ in any special way. (And I know the macOS terminal, and likely others, too, will render the flag correctly.)

So my request would be to remove lines 52-65 from cell.go. I'm not sure if "combining characters" are dealt with specifically in other places of tcell. If so, those should also be reviewed, I think.

Please let me know what you think about this.

The text was updated successfully, but these errors were encountered:

gdamore · 2019-03-15T07:49:10Z

So the challenge is that tcell needs to be able to take an array of runes, and determine where the boundary for the grapheme is, and also the width in display cells of that grapheme. (Frankly, Unicode is making this all nuts, by going crazy adding new encodings and new combining schemes. I mostly blame emoji although some natural languages do make this complex.

I believe that what should happen here is that the application should supply graphemes, with the
first rune as the primary character, and any additional runes in the combc. If I had to do this over again I'd probably use a variadic interface here instead.

Now even given those, I need to determine whether the grapheme occupies one cell, or two, or (in some cases) none. This lets me figure out what I'm actually going to emit, and determine how many columns are used.

If you could figure out how to combine the width calculation with the grapheme bits (e.g. a Graphemes.DisplayWidth() API), that would be really really helpful, I think. I suspect you could probably add that to the table property you already have.

DisplayWidth needs to cope with the fact that some character widths are "ambiguous", so probably it should return values for "narrow", "wide", and "ambiguous". Also "none" for zero width runes like control characters.

gdamore · 2019-03-15T07:56:12Z

As I think more about this and review the code, I think you're right. The logic there was intended to strip out bogus characters (non-combining) as those would normally be application errors, but I think this was a mistake in retrospect.

I suspect tcell doesn't even need to segment at all... but the views package (which displays strings) would need this via your uniseg package. We can probably get away with using the first rune to determine the width.

gdamore · 2019-03-15T08:24:39Z

See this for details on width: http://www.unicode.org/reports/tr11/

rivo · 2019-03-19T11:18:41Z

Thanks a lot!

Regarding the width of grapheme clusters (see also mattn/go-runewidth#28), I think what we want instead of (or in addition to) runewidth.RuneWidth() is something like runewidth.GraphemeClusterWidth() or make runewidth.StringWidth() use my new package. I don't know what the correct logic for this would be. You suggested using the first rune but there are cases where runewidth.RuneWidth() will return 0 for the first character:

fmt.Printf("%d\n", runewidth.RuneWidth(0x070f)) // Outputs "0".

(0x070f falls into the "Prepend" category which is allowed as a first character.)

My best guess at this point is to use the width of the first non-zero-width rune for the entire grapheme cluster. I could be wrong but I guess I'll try it for now in tview as runewidth.StringWidth() definitely returns the wrong widths at the moment (from what I can see, it only adds up individual runes' widths).

I don't know how much time @mattn has at the moment to work on this. I've added my own implementation to tview, you can find it here.

rivo mentioned this issue Mar 14, 2019

Regional Indicators (Flags) and Grapheme Clusters mattn/go-runewidth#28

Closed

gdamore changed the title ~~tcell fails on regional flags (or: DIY Unicode grapheme cluster breaking)~~ tcell breaks grapheme clusters Mar 19, 2019

gdamore closed this as completed in d195cc7 Mar 19, 2019

rivo mentioned this issue Mar 19, 2019

Combined emojis don't render correctly rivo/tview#236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcell breaks grapheme clusters #264

tcell breaks grapheme clusters #264

rivo commented Mar 14, 2019

gdamore commented Mar 15, 2019

gdamore commented Mar 15, 2019

gdamore commented Mar 15, 2019

rivo commented Mar 19, 2019

tcell breaks grapheme clusters #264

tcell breaks grapheme clusters #264

Comments

rivo commented Mar 14, 2019

gdamore commented Mar 15, 2019

gdamore commented Mar 15, 2019

gdamore commented Mar 15, 2019

rivo commented Mar 19, 2019