Width is 1 when it should be 2 #59

rivo · 2022-02-16T10:08:01Z

I stumbled over a character that, when output to the console directly, takes up two characters. But StringWidth() gives me 1. This is because the first rune of this character has a width of 1 and that's what's being used, see here. I know I wrote this code and I'm sure that you cannot simply add up the widths of individual runes ("🏳️‍🌈" would then have a width of 4 which is obviously wrong) and using the first rune's width worked fine so far. But it turns out that it fails in some cases.

I'm not familiar with Indian characters but it seems to me that the second rune is a modifier that turns the character from a width of 1 into a width of 2. Are you aware of any logic that we could add to go-runewidth that makes this right?

Here's example code that illustrates the issue:

package main

import (
	"fmt"

	runewidth "github.com/mattn/go-runewidth"
)

func main() {
	s := "खा"
	fmt.Println("0123456789")
	fmt.Println(s + "<")
	fmt.Printf("String width: %d\n", runewidth.StringWidth(s))
	var i int
	for _, r := range s {
		fmt.Printf("Rune %s  (%d) width: %d\n", string(r), i, runewidth.RuneWidth(r))
		i++
	}
}

Output (on macOS with iTerm2):

The text was updated successfully, but these errors were encountered:

dricottone · 2022-02-16T22:33:09Z

I don't think I've ever used StringWidth. I was introduced to this library by way of tcell and it's demos. You can see here that another strategy is to use RuneWidth in a loop.

And incidentally that code looks similar to the implementation of StringWidth, but without a lot of (probably important) edge case handling.

rivo · 2022-02-17T07:56:04Z

Yeah, we've been through this before. Just looking at individual runes is not enough to determine the width of a string. Characters can consist of multiple runes ("🏳️‍🌈" is composed of 4 runes and 14 bytes). StringWidth currently uses only the width of the first rune, which is a rule I implemented because I didn't know better. Now we know that this is not correct in some cases. But to be honest, I don't even know where to look for the correct rule. The width of characters on terminal screens is not part of any official specification.

I guess I'll have to dig into wcwidth.c which appears to be used by most terminals.

ericwq · 2022-05-21T01:46:49Z

runwidth can handle east asia character, but emoj/flag is very different. should be handled differently. https://unicode.org/emoji/charts/emoji-list.html

rivo · 2022-05-23T05:32:42Z

It seems that runewidth does not handle East Asian characters correctly, as you can see in the initial comment of this thread. I tried to look at other implementations of this but wasn't able to find a good answer as to how other packages solve this issue. The question remains, how these (and emoji) characters should be handled when composed of multiple runes.

mattn · 2022-05-23T05:54:54Z

खा 0x0916 is not Ambiguous (A). It is Neutral (N).

https://unicode.org/Public/13.0.0/ucd/EastAsianWidth.txt

rivo · 2022-05-23T06:02:23Z

What does this mean for the issue at hand? Can you elaborate?

mattn · 2022-05-23T06:09:49Z

If the field is W, runewidth.StringWidth return 2. Table of runewidth is generated from EastAsianWidth.txt. So runewidith works correctly, I think.

rivo · 2022-05-23T06:16:49Z

If the field is W, runewidth.StringWidth return 2.

I don't know what you mean by that. As I mentioned above, it currently returns 1 for the character at hand.

So are you saying we should add up the rune widths of all runes this character is composed of? Then how do we deal with emojis?

(I know I keep repeating myself here but maybe I'm not expressing myself clear enough.)

mattn · 2022-05-23T06:36:35Z

Ah, sorry. I did misreading. I didn't understand खा have two runes.

ericwq · 2022-05-23T13:07:08Z

Here is my fix. It's not perfect, but it works for me, tested with combining character, chinese, emoji, flags.

// In the loop, national flag's width got 1+1=2.
func runesWidth(runes []rune) (width int) {
	// quick pass for iso8859-1
	if len(runes) == 1 && runes[0] < 0x00fe {
		return 1
	}

	cond := runewidth.NewCondition()
	cond.StrictEmojiNeutral = false
	cond.EastAsianWidth = true

	// loop for multi rune
	width = 0
	for i := 0; i < len(runes); i++ {
		width += cond.RuneWidth(runes[i])
	}

	return width
}

There is a precondition: You must use uniseg to split it into multi runes.

graphemes := uniseg.NewGraphemes(str)
for graphemes.Next() {
	input = graphemes.Runes()
}

rivo · 2022-07-27T11:14:14Z

I just tried this and it gives me:

StringWidth("🏳️‍🌈") == 6

where it should be 2.

I think we're all underestimating this topic. If there's one thing I learned, it's that Unicode handling is complicated. Especially for emojis, there are various ways of constructing them, and simply adding up code point widths (or using the width of the first code point, like I suggested earlier) will not work.

I guess I have to do the grunt work and get into all the details.

rivo · 2022-09-11T09:17:13Z

Ok, so I went down the rabbit hole of calculating the monospace width of strings, like this package attempts. Unfortunately, the whole premise of doing this code point by code point (or, in Golang terms, rune by rune) will never lead to satisfactory results. It works ok for non-English Characters, e.g. 世界 or खा. You can simply add up the width of all individual runes.

But that completely fails for emojis. If you look at Unicode Technical Standard #51, you'll find that emojis can be encoded in tons of different ways. Zero-Width Joiner is just one small part of it. There are modifiers that force text presentation (=width of 1), modifiers that force normal characters like the digit "1" into emoji presentation (=width of 2), there are regional indicators (flags), complex emoji sequences, just to name a few.

To handle these correctly, we need to look at grapheme clusters as a whole. We already attempted this in this package, but it requires a lot more analysis of these clusters than what's happening now. I don't actually know how the original wcwidth handles emojis but similar packages like the Ruby version, janlelis/unicode-display_width, makes that extension optional, i.e. there is special handling for emojis. I also found that many terminals will render some emojis with the wrong width. So I guess wcwidth (presumably used by most of them) still has some flaws.

By now, I've implemented my own version of wcwidth in rivo/uniseg. uniseg already has all the tools in place for grapheme cluster parsing, code point classification etc. You can check out the algorithm here.

As for this package and this issue, I think there is value in being compatible with the original wcwidth functionality, as this will allow us to write Go programs that work on most terminals (assuming most terminals use wcwidth). If someone has the time, please research how wcwidth handles emojis. In my opinion, this would be the optimal solution.

For a quick fix, though, I would suggest the following:

Within a grapheme cluster, add up the widths of all individual runes. But if the first rune is an emoji (Extended Pictographic or Regional Indicator), always return a total width of 2.

This is not perfect but it will work for most use cases.

If you want to do emoji handling in a better way, another quick fix would be:

Within a grapheme cluster, add up the widths of all individual runes. But if the first rune is an emoji (Extended Pictographic or Regional Indicator), use the width returned by rivo/uniseg.

We're already using rivo/uniseg to determine the grapheme clusters so this will come for free. (This will still not handle conjoining Hangul characters correctly but maybe that's ok.)

If you want to get rid of rivo/uniseg again, sure, you can go back to adding up all rune widths again like it was in the beginning of this package. But many emojis will have the wrong width then. It's not something I would suggest but this is not my package, I'm not the one to decide.

… in StringWidth. See mattn#59

rivo · 2022-09-11T11:20:53Z

I updated PR #63 to reflect this. It's the "quick fix" solution I described above.

ericwq · 2022-09-11T11:21:37Z

glad to hear that. i will give it a try if i find some time.

mattn · 2022-09-11T16:35:55Z

Thanks. I'll look into it in later.

ericwq · 2022-09-29T07:31:39Z

@rivo It works perfectly. Thanks for your contribution!

… to speedier version. Adapted to short-notice change in rivo/uniseg. Upgraded to latest rivo/uniseg. Also implemented basic emoji handling in StringWidth. See mattn#59 Added a test for Truncate that includes emojis. Split the code so it can be upgraded once we move to Go1.18+ The wrong uniseg version was used. Fixed it.

This was referenced Feb 16, 2022

Unwanted artifacts on screen when scrolling text with a lot of unicode characters rivo/tview#693

Closed

Shifting characters gdamore/tcell#515

Closed

rivo added a commit to rivo/go-runewidth that referenced this issue Sep 11, 2022

Upgraded to latest rivo/uniseg. Also implemented basic emoji handling…

2e820a2

… in StringWidth. See mattn#59

rivo mentioned this issue Sep 11, 2022

Upgraded rivo/uniseg to latest version, switched StringWidth/Truncate to speedier version #63

Open

junegunn mentioned this issue Jan 4, 2024

Lines not flushed correctly in presence of emoji junegunn/fzf#3558

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Width is 1 when it should be 2 #59

Width is 1 when it should be 2 #59

rivo commented Feb 16, 2022

dricottone commented Feb 16, 2022

rivo commented Feb 17, 2022

ericwq commented May 21, 2022

rivo commented May 23, 2022

mattn commented May 23, 2022

rivo commented May 23, 2022

mattn commented May 23, 2022

rivo commented May 23, 2022

mattn commented May 23, 2022 •

edited

Loading

ericwq commented May 23, 2022

rivo commented Jul 27, 2022

rivo commented Sep 11, 2022

rivo commented Sep 11, 2022

ericwq commented Sep 11, 2022

mattn commented Sep 11, 2022

ericwq commented Sep 29, 2022

Width is 1 when it should be 2 #59

Width is 1 when it should be 2 #59

Comments

rivo commented Feb 16, 2022

dricottone commented Feb 16, 2022

rivo commented Feb 17, 2022

ericwq commented May 21, 2022

rivo commented May 23, 2022

mattn commented May 23, 2022

rivo commented May 23, 2022

mattn commented May 23, 2022

rivo commented May 23, 2022

mattn commented May 23, 2022 • edited Loading

ericwq commented May 23, 2022

rivo commented Jul 27, 2022

rivo commented Sep 11, 2022

rivo commented Sep 11, 2022

ericwq commented Sep 11, 2022

mattn commented Sep 11, 2022

ericwq commented Sep 29, 2022

mattn commented May 23, 2022 •

edited

Loading