Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regional Indicators (Flags) and Grapheme Clusters #28

Closed
rivo opened this issue Mar 14, 2019 · 13 comments
Closed

Regional Indicators (Flags) and Grapheme Clusters #28

rivo opened this issue Mar 14, 2019 · 13 comments

Comments

@rivo
Copy link

rivo commented Mar 14, 2019

Here's a short example that illustrates an issue with flags (or "regional indicators"):

fmt.Println(runewidth.StringWidth("🇩🇪")) // Should be "2", outputs "4".

The flag consists of two code points which are processed separately by runewidth. But most modern systems will combine them into one flag emoji.

This is part of a larger topic which I describe in more detail here: gdamore/tcell#264. It doesn't just affect flags but also characters in e.g. Arabic and Korean where there are more sophisticated rules than "combining characters" and zero-width joiners (which you added with #20).

I don't know exactly how you calculate the widths of characters. I'm also not sure how you would solve flags as well as some of the other rules described in the Unicode specification but it would sure be nice as printing these flags currently gives me trouble in tview. There have been multiple issues asking for better support for different languages and emojis so it seems that there are quite a few people who use the terminal with these characters.

(Maybe my new package uniseg can help you here.)

@rivo
Copy link
Author

rivo commented Mar 19, 2019

Here's my own implementation of the "string width" function which takes grapheme clusters into account:

https://github.com/rivo/tview/blob/8d5eba0c2f51d8ae971c5a470e354bbc2aae6777/util.go#L419

It's based on the assumption that the width of a grapheme cluster is the width of the first non-zero-width rune. That's just my guess but it works fine for a bunch of examples I tried manually.

Maybe you want to use this implementation in your package. I think it would definitively improve the calculation of a string width. You could then also get rid of the special zero-width-joiner handling as it's all implicit in the uniseg package.

@mattn
Copy link
Owner

mattn commented Mar 24, 2019

Could you please send me PR?

@alecrabbit
Copy link

Hi! I'm not sure if this issues related but assume they are.
characters {"←", "↖", "↑", "↗", "→", "↘", "↓", "↙"} accepted by my terminal as of width 1 and all is working as it should, however runewidth.StringWidth(char) is giving [1 2 1 2 1 2 1 2] correspondingly and that breaks output

// Character StringWidth uniseg.Graphemes
    ←         1           [2190]
    ↖         2           [2196]
    ↑         1           [2191]
    ↗         2           [2197]
    →         1           [2192]
    ↘         2           [2198]
    ↓         1           [2193]
    ↙         2           [2199]

same for

// Character StringWidth uniseg.Graphemes
    ■         1           [25a0]
    □         1           [25a1]
    ▪         2           [25aa]
    ▫         2           [25ab]

I hope this additional info will help.

My php package php-wcwidth (which is practically a dumb clone of python's jquast/wcwidth) gets widths of these chars correctly

@mattn
Copy link
Owner

mattn commented Sep 20, 2019

Thank you. Could you please show me screenshot?

This is an screenshot taken on my environment.

image

@alecrabbit
Copy link

this one?
image

@alecrabbit
Copy link

same but larger
image

@alecrabbit
Copy link

and from terminal
image

@mattn
Copy link
Owner

mattn commented Sep 20, 2019

What is your $LANG?

@alecrabbit
Copy link

LANG=en_US.UTF-8

@mattn
Copy link
Owner

mattn commented Sep 20, 2019

@joshuarubin 0x2194 in emoji is correctly?

@alecrabbit
Copy link

@mattn here's what I found out:
these do not have an emojis
image

but these do:
image

@alecrabbit
Copy link

and my terminal can print them both
image
however, I'm unable to figure out how to print it from my code
printing by code gives copy-pasting also gives

@alecrabbit
Copy link

alecrabbit commented Sep 20, 2019

it seems like 2194 is followed by fe0f to print emoji
so 2194 fe0f

UPD
DerivedGeneralCategory.txt:

FE00..FE0F    ; Mn #  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16

@mattn mattn closed this as completed in d3f4cc2 Jan 11, 2021
mattn added a commit that referenced this issue Jan 11, 2021
Fixed StringWidth() implementation by using proper Unicode grapheme cluster segmentation. Fixes #28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants