Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcell breaks grapheme clusters #264

Closed
rivo opened this issue Mar 14, 2019 · 4 comments
Closed

tcell breaks grapheme clusters #264

rivo opened this issue Mar 14, 2019 · 4 comments

Comments

@rivo
Copy link
Contributor

rivo commented Mar 14, 2019

The following code crashes tcell:

package main

import "github.com/gdamore/tcell"

func main() {
	text := []rune("🇩🇪")
	screen, _ := tcell.NewScreen()
	screen.Init()
	screen.SetContent(0, 0, text[0], text[1:], tcell.StyleDefault.Foreground(tcell.ColorWhite))
	screen.Sync()
	screen.PollEvent()
	screen.Fini()
}

Panic trace:

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/gdamore/tcell.(*CellBuffer).SetContent(0xc000094360, 0x0, 0x0, 0x1f1e9, 0xc0000b2774, 0x1, 0x1, 0x400000f00000000)
/github.com/gdamore/tcell/cell.go:61 +0x384
github.com/gdamore/tcell.(*tScreen).SetContent(0xc000094340, 0x0, 0x0, 0x1f1e9, 0xc0000b2774, 0x1, 0x1, 0x400000f00000000)
/github.com/gdamore/tcell/tscreen.go:435 +0xba
main.main()
/test.go:9 +0xbc
exit status 2

I've spent the last few days on this topic and realized we (tview, tcell, and go-runewidth) have all been rolling our own Unicode code point combination/splitting. Mostly by handling Modifiers (e.g. "a" (0x61) + "◌̈" (0x308) = "ä") and recently (#233) adding some Zero-Width Joiner (ZWJ) support.

It turns out that the topic is much more sophisticated and Unicode defines specific rules on what constitutes a character (the so-called "grapheme cluster"). They can be found in Unicode Standard Annex 29, specifically Section 3.1.1. (Modifiers and ZWJs are just one of the 14 rules).

The document also describes how to deal with flags, or "regional indicator symbols", as in the code example above.

I've published a new package https://github.com/rivo/uniseg which implements these rules so we can now split any string into its characters according to Annex 29.

Regarding tcell and the issue above, I actually think it doesn't need to handle any of this. It should be up to the caller to determine what is a character and pass those code points on to tcell. The SetContent() function should, in my opinion, not drop any runes from the combining characters slice. (And runewidth.RuneWidth(r) != 0 is not a good indicator for "invalid code points" anyway, as we saw in the panic above.) If I want to print "🏳️‍🌈", I send its code points 0x1f3f3 0xfe0f 0x200d 0x1f308 to tcell and those should be written to the terminal, no need to handle the Modifier or ZWJ in any special way. (And I know the macOS terminal, and likely others, too, will render the flag correctly.)

So my request would be to remove lines 52-65 from cell.go. I'm not sure if "combining characters" are dealt with specifically in other places of tcell. If so, those should also be reviewed, I think.

Please let me know what you think about this.

@gdamore
Copy link
Owner

gdamore commented Mar 15, 2019

So the challenge is that tcell needs to be able to take an array of runes, and determine where the boundary for the grapheme is, and also the width in display cells of that grapheme. (Frankly, Unicode is making this all nuts, by going crazy adding new encodings and new combining schemes. I mostly blame emoji although some natural languages do make this complex.

I believe that what should happen here is that the application should supply graphemes, with the
first rune as the primary character, and any additional runes in the combc. If I had to do this over again I'd probably use a variadic interface here instead.

Now even given those, I need to determine whether the grapheme occupies one cell, or two, or (in some cases) none. This lets me figure out what I'm actually going to emit, and determine how many columns are used.

If you could figure out how to combine the width calculation with the grapheme bits (e.g. a Graphemes.DisplayWidth() API), that would be really really helpful, I think. I suspect you could probably add that to the table property you already have.

DisplayWidth needs to cope with the fact that some character widths are "ambiguous", so probably it should return values for "narrow", "wide", and "ambiguous". Also "none" for zero width runes like control characters.

@gdamore
Copy link
Owner

gdamore commented Mar 15, 2019

As I think more about this and review the code, I think you're right. The logic there was intended to strip out bogus characters (non-combining) as those would normally be application errors, but I think this was a mistake in retrospect.

I suspect tcell doesn't even need to segment at all... but the views package (which displays strings) would need this via your uniseg package. We can probably get away with using the first rune to determine the width.

@gdamore
Copy link
Owner

gdamore commented Mar 15, 2019

See this for details on width: http://www.unicode.org/reports/tr11/

@gdamore gdamore changed the title tcell fails on regional flags (or: DIY Unicode grapheme cluster breaking) tcell breaks grapheme clusters Mar 19, 2019
@rivo
Copy link
Contributor Author

rivo commented Mar 19, 2019

Thanks a lot!

Regarding the width of grapheme clusters (see also mattn/go-runewidth#28), I think what we want instead of (or in addition to) runewidth.RuneWidth() is something like runewidth.GraphemeClusterWidth() or make runewidth.StringWidth() use my new package. I don't know what the correct logic for this would be. You suggested using the first rune but there are cases where runewidth.RuneWidth() will return 0 for the first character:

fmt.Printf("%d\n", runewidth.RuneWidth(0x070f)) // Outputs "0".

(0x070f falls into the "Prepend" category which is allowed as a first character.)

My best guess at this point is to use the width of the first non-zero-width rune for the entire grapheme cluster. I could be wrong but I guess I'll try it for now in tview as runewidth.StringWidth() definitely returns the wrong widths at the moment (from what I can see, it only adds up individual runes' widths).

I don't know how much time @mattn has at the moment to work on this. I've added my own implementation to tview, you can find it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants