Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

Open
qsantos opened this issue Nov 16, 2024 · 6 comments

Comments

@qsantos
Copy link

qsantos commented Nov 16, 2024

tl;dr: calculate_position should not use the lengths of graphemes as provided by unicode-width, but instead use the sum of the widths of the codepoints.

At least on Unix, when calculating the width of displayed characters, rustline uses grapheme segmentation.

However, using the minimal example and pasting 👨‍👩‍👧‍👦 (\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}), and then typing A, results in the following output:
image

This is because my terminal does not interpret the ZERO WIDTH JOINER (U+200D). In fact, I was able to reproduce this behavior in the following terminal emulators:

  • xfce4-terminal
  • gnome-terminal
  • rxvt-unicode
  • mate-terminal
  • blackbox-terminal
  • Putty
  • Kitty
  • Mac's Terminal
  • iTerm2
  • VS Code's built-in terminal
  • Intellij IDEA's terminal

Edit: Regarding the sentence betlow, the UAX #11 actually says nothing about graphemes. It mostly talks about CJK characters and half-width variants, which do not require grapheme handling either. In fact UTS #51 says that the handling of the ZERO WIDTH JOINER can vary by platform. So what we are seeing is a choice made by unicode-width. rustyline might not want to follow it, and use the sum of the widths of the individual code points instead.

Unicode does say that the full grapheme should be considered, and unicode-width implement it so:

use unicode_width::{UnicodeWidthChar, UnicodeWidthStr};

fn main() {
    let s = "\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}";
    println!("{} {s}", s.width());
    for c in s.chars() {
        println!("{} {c}", c.width().unwrap_or(9));
    }
}

outputs:

2 👨‍👩‍👧‍👦
2 👨
0 ‍
2 👩
0 ‍
2 👧
0 ‍
2 👦

The first line looks correct in a graphical browser, but this is what I actually see:

image

Also note that this is not about legacy vs extended graphemes. ZERO WIDTH JOINER is considered in both.

If I remove the .graphemes(true) part from calculate_position (and adapt the code to use codepoints instead of grapheme clusters), I achieve the expected behavior:

image

Are there cases where we do need to use grapheme clusters when calculating widths? That is, either:

  • Common terminal emulators that follow the Unicode specification more closely.
  • Other grapheme clusters whose visual width is different from the sum of the graphemes corresponding to the codepoints taken separately, in common terminal emulators.
@qsantos qsantos changed the title Interpreting graphene clusters when calculating width breaks on most terminal emulators Interpreting grapheme clusters when calculating width breaks on most terminal emulators Nov 16, 2024
@gwenn
Copy link
Collaborator

gwenn commented Nov 16, 2024

See #184

@qsantos
Copy link
Author

qsantos commented Nov 16, 2024

Thanks, I should have searched for “width” instead of “grapheme”.

@gwenn
Copy link
Collaborator

gwenn commented Nov 17, 2024

Cannot reproduce with iterm2, WezTerm:
image
But with kitty:
image
And Mac terminal:
image
And Alacritty:
image

@qsantos
Copy link
Author

qsantos commented Nov 18, 2024

I tried again with iTerm2, making sure to update to the latest version (3.5.9). The behavior is actually a bit different from other terminals. In my original post, I had only checked that my iTerm2 did not interpret ZJWs when pasing in Zsh/Python3. But the ZJWs are actually taken into account when running in Rustyline! At least for the version without any skin color variations:

iTerm2 Screenshot

The first emoji seems to come from Apple giving up on ZJW and using a generic image. However, the version with skin color variation is still rendered as individual emojis. Could you check how it looks for you with, e.g. 👩🏼‍👨🏼‍👦🏼‍👦🏼?

In any case, interpreting that first one as having a width of 8 would definitely break things there. One option would be to check for LC_TERMINAL=iTerm2 and adapt the behavior, but that sounds brittle. Another, more robust option, might be to actually filter ZJWs out when displaying.

@gwenn
Copy link
Collaborator

gwenn commented Nov 23, 2024

There seems to be three modes:

  • grapheme width (like rustyline)
  • codepoint width (like you suggested)
  • and no zero-width joiner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants