Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

qsantos · 2024-11-16T18:04:23Z

tl;dr: calculate_position should not use the lengths of graphemes as provided by unicode-width, but instead use the sum of the widths of the codepoints.

At least on Unix, when calculating the width of displayed characters, rustline uses grapheme segmentation.

However, using the minimal example and pasting 👨‍👩‍👧‍👦 (\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}), and then typing A, results in the following output:

This is because my terminal does not interpret the ZERO WIDTH JOINER (U+200D). In fact, I was able to reproduce this behavior in the following terminal emulators:

xfce4-terminal
gnome-terminal
rxvt-unicode
mate-terminal
blackbox-terminal
Putty
Kitty
Mac's Terminal
iTerm2
VS Code's built-in terminal
Intellij IDEA's terminal

Edit: Regarding the sentence betlow, the UAX #11 actually says nothing about graphemes. It mostly talks about CJK characters and half-width variants, which do not require grapheme handling either. In fact UTS #51 says that the handling of the ZERO WIDTH JOINER can vary by platform. So what we are seeing is a choice made by unicode-width. rustyline might not want to follow it, and use the sum of the widths of the individual code points instead.

~~Unicode does say that the full grapheme should be considered, and~~ unicode-width implement it so:

use unicode_width::{UnicodeWidthChar, UnicodeWidthStr};

fn main() {
    let s = "\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}";
    println!("{} {s}", s.width());
    for c in s.chars() {
        println!("{} {c}", c.width().unwrap_or(9));
    }
}

outputs:

2 👨‍👩‍👧‍👦
2 👨
0 ‍
2 👩
0 ‍
2 👧
0 ‍
2 👦

The first line looks correct in a graphical browser, but this is what I actually see:

Also note that this is not about legacy vs extended graphemes. ZERO WIDTH JOINER is considered in both.

If I remove the .graphemes(true) part from calculate_position (and adapt the code to use codepoints instead of grapheme clusters), I achieve the expected behavior:

Are there cases where we do need to use grapheme clusters when calculating widths? That is, either:

Common terminal emulators that follow the Unicode specification more closely.
Other grapheme clusters whose visual width is different from the sum of the graphemes corresponding to the codepoints taken separately, in common terminal emulators.

The text was updated successfully, but these errors were encountered:

gwenn · 2024-11-16T19:58:04Z

See #184

qsantos · 2024-11-16T20:00:46Z

Thanks, I should have searched for “width” instead of “grapheme”.

gwenn · 2024-11-17T13:53:54Z

Cannot reproduce with iterm2, WezTerm:

But with kitty:

And Mac terminal:

And Alacritty:

qsantos · 2024-11-18T18:35:50Z

I tried again with iTerm2, making sure to update to the latest version (3.5.9). The behavior is actually a bit different from other terminals. In my original post, I had only checked that my iTerm2 did not interpret ZJWs when pasing in Zsh/Python3. But the ZJWs are actually taken into account when running in Rustyline! At least for the version without any skin color variations:

The first emoji seems to come from Apple giving up on ZJW and using a generic image. However, the version with skin color variation is still rendered as individual emojis. Could you check how it looks for you with, e.g. 👩🏼‍👨🏼‍👦🏼‍👦🏼?

In any case, interpreting that first one as having a width of 8 would definitely break things there. One option would be to check for LC_TERMINAL=iTerm2 and adapt the behavior, but that sounds brittle. Another, more robust option, might be to actually filter ZJWs out when displaying.

gwenn · 2024-11-23T09:52:22Z

Maybe we can check if current terminal supports emoji with this:
https://github.com/rockorager/libvaxis/blob/62854672ef2c7e70ff73056aa789db44efa69442/src/ctlseqs.zig#L9
wez/wezterm#4320
kovidgoyal/kitty#7799
https://mitchellh.com/writing/grapheme-clusters-in-terminals

gwenn · 2024-11-23T14:52:28Z

There seems to be three modes:

grapheme width (like rustyline)
codepoint width (like you suggested)
and no zero-width joiner

qsantos changed the title ~~Interpreting graphene clusters when calculating width breaks on most terminal emulators~~ Interpreting grapheme clusters when calculating width breaks on most terminal emulators Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

qsantos commented Nov 16, 2024 •

edited

Loading

gwenn commented Nov 16, 2024

qsantos commented Nov 16, 2024

gwenn commented Nov 17, 2024 •

edited

Loading

qsantos commented Nov 18, 2024

gwenn commented Nov 23, 2024

gwenn commented Nov 23, 2024

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

Interpreting grapheme clusters when calculating width breaks on most terminal emulators #826

Comments

qsantos commented Nov 16, 2024 • edited Loading

gwenn commented Nov 16, 2024

qsantos commented Nov 16, 2024

gwenn commented Nov 17, 2024 • edited Loading

qsantos commented Nov 18, 2024

gwenn commented Nov 23, 2024

gwenn commented Nov 23, 2024

qsantos commented Nov 16, 2024 •

edited

Loading

gwenn commented Nov 17, 2024 •

edited

Loading