-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Width is 1 when it should be 2 #59
Comments
I don't think I've ever used StringWidth. I was introduced to this library by way of tcell and it's demos. You can see here that another strategy is to use RuneWidth in a loop. And incidentally that code looks similar to the implementation of StringWidth, but without a lot of (probably important) edge case handling. |
Yeah, we've been through this before. Just looking at individual runes is not enough to determine the width of a string. Characters can consist of multiple runes ("🏳️🌈" is composed of 4 runes and 14 bytes). I guess I'll have to dig into |
runwidth can handle east asia character, but emoj/flag is very different. should be handled differently. https://unicode.org/emoji/charts/emoji-list.html |
It seems that |
|
What does this mean for the issue at hand? Can you elaborate? |
If the field is |
I don't know what you mean by that. As I mentioned above, it currently returns So are you saying we should add up the rune widths of all runes this character is composed of? Then how do we deal with emojis? (I know I keep repeating myself here but maybe I'm not expressing myself clear enough.) |
Ah, sorry. I did misreading. I didn't understand |
Here is my fix. It's not perfect, but it works for me, tested with combining character, chinese, emoji, flags. // In the loop, national flag's width got 1+1=2.
func runesWidth(runes []rune) (width int) {
// quick pass for iso8859-1
if len(runes) == 1 && runes[0] < 0x00fe {
return 1
}
cond := runewidth.NewCondition()
cond.StrictEmojiNeutral = false
cond.EastAsianWidth = true
// loop for multi rune
width = 0
for i := 0; i < len(runes); i++ {
width += cond.RuneWidth(runes[i])
}
return width
} There is a precondition: You must use graphemes := uniseg.NewGraphemes(str)
for graphemes.Next() {
input = graphemes.Runes()
} |
I just tried this and it gives me: StringWidth("🏳️🌈") == 6 where it should be 2. I think we're all underestimating this topic. If there's one thing I learned, it's that Unicode handling is complicated. Especially for emojis, there are various ways of constructing them, and simply adding up code point widths (or using the width of the first code point, like I suggested earlier) will not work. I guess I have to do the grunt work and get into all the details. |
Ok, so I went down the rabbit hole of calculating the monospace width of strings, like this package attempts. Unfortunately, the whole premise of doing this code point by code point (or, in Golang terms, rune by rune) will never lead to satisfactory results. It works ok for non-English Characters, e.g. 世界 or खा. You can simply add up the width of all individual runes. But that completely fails for emojis. If you look at Unicode Technical Standard #51, you'll find that emojis can be encoded in tons of different ways. Zero-Width Joiner is just one small part of it. There are modifiers that force text presentation (=width of 1), modifiers that force normal characters like the digit "1" into emoji presentation (=width of 2), there are regional indicators (flags), complex emoji sequences, just to name a few. To handle these correctly, we need to look at grapheme clusters as a whole. We already attempted this in this package, but it requires a lot more analysis of these clusters than what's happening now. I don't actually know how the original By now, I've implemented my own version of As for this package and this issue, I think there is value in being compatible with the original For a quick fix, though, I would suggest the following:
This is not perfect but it will work for most use cases. If you want to do emoji handling in a better way, another quick fix would be:
We're already using If you want to get rid of |
… in StringWidth. See mattn#59
I updated PR #63 to reflect this. It's the "quick fix" solution I described above. |
glad to hear that. i will give it a try if i find some time. |
Thanks. I'll look into it in later. |
@rivo It works perfectly. Thanks for your contribution! |
… to speedier version. Adapted to short-notice change in rivo/uniseg. Upgraded to latest rivo/uniseg. Also implemented basic emoji handling in StringWidth. See mattn#59 Added a test for Truncate that includes emojis. Split the code so it can be upgraded once we move to Go1.18+ The wrong uniseg version was used. Fixed it.
I stumbled over a character that, when output to the console directly, takes up two characters. But
StringWidth()
gives me1
. This is because the first rune of this character has a width of1
and that's what's being used, see here. I know I wrote this code and I'm sure that you cannot simply add up the widths of individual runes ("🏳️🌈" would then have a width of 4 which is obviously wrong) and using the first rune's width worked fine so far. But it turns out that it fails in some cases.I'm not familiar with Indian characters but it seems to me that the second rune is a modifier that turns the character from a width of
1
into a width of2
. Are you aware of any logic that we could add togo-runewidth
that makes this right?Here's example code that illustrates the issue:
Output (on macOS with iTerm2):
The text was updated successfully, but these errors were encountered: