Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong width for U+00AD #8

Open
stevengj opened this issue Mar 11, 2015 · 25 comments
Open

wrong width for U+00AD #8

stevengj opened this issue Mar 11, 2015 · 25 comments

Comments

@stevengj
Copy link

Hi, I was looking at your wcwidth library for comparison, since in the utf8proc library we are also implementing a similar feature (see JuliaStrings/utf8proc#2). The first disagreement that I came across between your implementation and ours was for U+00AD (soft hyphen), where you seem to give 1

>>> from wcwidth import wcwidth
>>> wcwidth(unichr(173))
1

and we give zero (a soft hyphen is used for line breaking, but is ordinarily not printed). In general, we return 0 for most characters in category Cf (formatting control characters). The wcwidth function on MacOS 10.10.2 also returns -1 (not printable) for this code point.

Am I calling your implementation incorrectly? This is for git master of wcwidth.

@stevengj
Copy link
Author

In case it is helpful, the draft table of character widths that we are currently planning to use can be found in this CharWidths.txt gist (each line of which is codepoints; width) where non-printing characters are assigned a width of 0. This is generated automatically from the unicode 7 tables combined with font metrics from GNU unifont, as described in JuliaStrings/utf8proc#27

@jquast
Copy link
Owner

jquast commented Mar 11, 2015

Interesting, what terminal are you testing the "character cells consumed when printed" on OSX? I too am using OSX, and on iTerm2 it displays as "a-b", consuming 3 characters, so wcwidth would be correct, here... I would need to see evidence of it not forwarding the cell when printed on at least some terminal emulators, and file bugs for the others. Just to be very clear, the purpose of wcwidth is "printable width on a terminal", and not firefox or anything else (for which such character is hidden).

Also, I don't necessarily trust the OS-provided 'wcwidth', they are typically based on very old (5-10 years old) unicode specifications. I have a program I've tested on osx and linux, both are wildly different, and in each case my version was correct: https://github.com/jquast/wcwidth/blob/master/bin/wcwidth-libc-comparator.py

The combining and wide character tables are programmatically updated by "python setup.py update", which is similar to your https://github.com/JuliaLang/utf8proc/pull/27/files#diff-3832b9cfe2fc10d35ac5c63d9b7b8133R20

There is no unicode specification reference tables for 0-width characters that I know of, so its just hardcoded here https://github.com/jquast/wcwidth/blob/master/wcwidth/wcwidth.py#L161-171

@jquast
Copy link
Owner

jquast commented Mar 11, 2015

Using the 'Cf' category listings on iTerm2, it appears the following all consume 1 character cell, some with symbols, some simply by blanks

00AD
0600
0601
0602
0603
0604
0605
061C
06DD
070F
200E
200F
202A
202B
202C
202D
202E
2060
2061
2062
2063
2064
2066
2067
2068
2069
206A
206B
206C
206D
206E
206F
FFF9
FFFA
FFFB

And the following consume 0 cells:

180E
200B
200C
200D
FEFF

which may indeed need to be supported by wcwidth once i test a few more terminals

@stevengj
Copy link
Author

(We don't trust the system-provided wcwidth either, for the same reason as you, which is why we compute the widths independently. However, the OSX 10.10.2 wcwidth agrees with our results when it returns a nonnegative value, so it mostly seems to have errors of omission—it returns -1 for many valid printable characters from recent Unicode standards. Moreover, U+00AD has been part of Unicode since 1993, so I would think that most wcwidth implementations would handle it properly.)

There is an interesting article on the soft hyphen, which apparently has had a controversial history, and is rendered in different ways depending on the font and the rendering system. I'm not sure what the right answer is here, but the Unicode standard seems to somewhat favor the viewpoint that it should be invisible although it leaves it up to the implementation. However, the article mentions that the Unicode FAQ does say In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances and maybe that is what is done in practice.

cc: @jiahao and @StefanKarpinski.

@stevengj
Copy link
Author

Note that the Arabic characters U+0601 etc. are defined by the unicode standard as exceptions to usual rule that Cf characters are invisible:

  • Unlike most other format control characters, however, they should be rendered with a visible glyph, even in circumstances where no suitable digit or sequence of digits follows them in logical order. — Unicode Standard v.6.2.0, Section 8.2 - Arabic (p.256)

In contrast, e.g. U+200E is a left-to-right mark, and in my understanding is defined as an invisible formatting character that controls the direction of the text. Some terminals may give it a nonzero width (although the MacOS Terminal with the default font gives it zero width on my machine), but that seems like a bug in the terminal (or the font); it seems like it is better to return what the Unicode standard says rather than propagating a particular buggy implementation.

@jiahao
Copy link

jiahao commented Mar 12, 2015

I remember that article about the soft hyphen. Under "Modern Unicode semantics" it references UAX 14 for Unicode 7.0.0, §5.4, which says:

Unlike U+2010 hyphen, which always has a visible rendition, the character U+00AD soft hyphen (shy) is an invisible format character that merely indicates a preferred intraword line break position. If the line is broken at that point, then whatever mechanism is appropriate for intraword line breaks should be invoked, just as if the line break had been triggered by another hyphenation mechanism, such as a dictionary lookup.

The description in the following paragraphs suggests that the rendering of a soft hyphen is accomplished not by printing the soft hyphen itself, but rather by inserting an additional, printable hyphen glyph:

The inserted hyphen glyph can take a wide variety of shapes, as appropriate for the situation. Examples include shapes like U+2010 hyphen, U+058A armenian hyphen, U+180A mongolian nirugu, or U+1806 mongolian todo soft hyphen.

Based on this description it would seem that the character U+00AD by itself is nonprintable and should have a width of 0 or -1.

@stevengj
Copy link
Author

Interestingly, the Unicode FAQ entry that the SHY article quoted seems to no longer exist — from that passage in the Unicode 7.0.0 standard that @jiahao quoted it seems like the Unicode consortium decided to put its foot down and and declare that the soft hyphen is definitely invisible, ISO 8859-1 be damned.

@jquast
Copy link
Owner

jquast commented Apr 21, 2015

I really appreciate all of the resarch, @stevengj and @jiahao.

My decision is to use the common denominator across the most popular terminal emulators
for wcwidth. I might make a note of it in the readme that it deviates from the standard, as the
primary purpose of this project is how text is displayed by the most common (utf-8 capable)
terminal emulators.

I've made a checklist:

  • create a bin/cf-print-test.py or amend bin/wcwidth-browser.py to display 'Cf' categoryv

Then, test the following and report:

  • xterm (any)
  • iTerm2 (osx)
  • Terminal.app (osx)
  • PuTTy (windows)
  • gnome-terminal (covers all libvte-based emulators, linux)

I'm not sure how to gauge the "popular terminal emulators", this is just from memory.

sidenote: More importantly, how to factor their weight in wcwidth for any given
differences: perhaps some way to configure how the printable width
of such discrepancies may be reported if the consumer of wcwidth
knows their target audience's emulator (unfortunately all such terminals
borrow the common value "xterm" or "xterm-256color" as the OS
Environment Variable for TERM, and using the response of
the "answerback sequence" (^E) which at least PuTTY replies
to, but I'm afraid thats far out of scope for wcwidth, it would
require interaction with a terminal driver.

Finally, we can make a PR and release any update.

@jiahao
Copy link

jiahao commented Apr 21, 2015

@jquast thanks for your detailed consideration. As you had stated above, iTerm seems to have different needs from us at this point.

@jiahao
Copy link

jiahao commented Apr 21, 2015

However, I don't think it is possible to provide consistency across terminal environments without considering also the interactions with the choice of users' fonts. Many fonts simply have wrong advance widths for some code points.

Here is a simple rendering text for the fixed width fonts on my system. Consider

U+003C9 U+00302= \omega\hat =  ω̂

should render with the hat combining character on the omega.

U+00302 U+003C9 = \hat\omega =  ̂ω

should render with a hat to the left of omega.

screen shot 2015-04-21 at 5 39 28 pm
screen shot 2015-04-21 at 5 39 37 pm
screen shot 2015-04-21 at 5 39 48 pm

@jquast
Copy link
Owner

jquast commented Apr 21, 2015

You are correct, but terminal emulators don't typically care, they're the ones who handle the width of "printable cells" -- What is your system, is it a terminal emulator?

@jiahao
Copy link

jiahao commented Apr 21, 2015

The screenshots I pasted were taken from an IPython notebook rendering test HTML using those fonts. I can see the same spacing issues if I manually change the font in OSX Terminal and generate these characters in the Julia console REPL.

@jquast
Copy link
Owner

jquast commented Sep 14, 2015

Version wcwidth 0.1.5 which includes better combining character width determination by PR #11 is available on pypi.

A terminal sequence may be emitted to illicit the terminal emulator to respond with its cursor position.

This can be used to manually display all questionable characters across different popular Font face profiles and terminal emulators, and programatically determine whether they consider it 0 width for such characters, making a report of the most common discrepenancies, weighing on the side of "most correct", resolving any.

@jquast jquast added the bug label Jun 1, 2020
jquast added a commit that referenced this issue Oct 30, 2023
Major
-----

Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow !

This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables.

Tests
-----

- `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication.
- new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada.  
- added pytest-benchmark plugin, example use:

        # baseline
        tox -epy312 -- --verbose --benchmark-save=original
        # compare
        tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
@jquast
Copy link
Owner

jquast commented Oct 30, 2023

This is closed by #91

  • About U+00AD in particular, it is part of the Cf category, and the entire category of 'Cf' is now classified as zero-width, along with 'Mc', 'Zl', 'Zp', and part of 'Sk' category. I have written this specification that describes precisely how the width of characters are determined https://github.com/jquast/wcwidth/blob/master/docs/specs.rst#width-of-0 I hope it is helpful.

  • This issue also talked about the need to best match the behavior of popular terminals. I have also published an automatic testing tool for wide, zero, combining, and emoji zwj sequences. Though this only works with python's wcwidth, the technique would be very easy to copy to or aide other languages or wcwidth implementations, https://pypi.org/project/ucs-detect/

  • And finally, "BIDI" text was mentioned, I suggest to see related resource https://gist.github.com/XVilka/a0e49e1c65370ba11c17 about the state of BIDI, it has had some traction in the last few years, in any case the 'ucs-tool' appears to verify left-to-right text with wcwidth is ok. The LTR marker is 0-width.

@jquast jquast closed this as completed Oct 30, 2023
@avih
Copy link

avih commented Mar 13, 2024

For reference, in glibc wcwidth(0xad) appears to be 1.

Judging by this discussion: https://sourceware.org/bugzilla/show_bug.cgi?id=22073 which concluded that it should be 1.

That discussion took place in 2017 - after the main discussion in this issue, but before the last #8 (comment) here.

Also, in musl-libc, 0xad is also of wcwidth 1.

@jquast jquast reopened this Mar 13, 2024
@jquast
Copy link
Owner

jquast commented Mar 13, 2024

It's a bit ambiguous isn't it? From https://codepoints.net/U+00AD,

is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they are fall on the line end but remain invisible within the line.

I will add a test to ucs-detect and whichever measured width (0 or 1) that is used among the most popular and compliant terminals will be used in this library.

@avih
Copy link

avih commented Mar 13, 2024

For what it's worth, the musl-libc maintainer, @richfelker said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption".

Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?).

Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc.

And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice.

@jquast
Copy link
Owner

jquast commented Mar 13, 2024 via email

@avih
Copy link

avih commented Mar 13, 2024

if the most popular terminal emulators measure it as width of 1 then I’d like to match

Right.

I would guess that terminals measure its width according to the wcwidth implementation which they use? And I would also guess that typically that would be whatever libc provides? (not including windows terminal, which brings its own implementation, because on windows there's no system wcwidth).

And so ultimately, I would think the goal should be agreement between wcwidth implementations, rather than between this implementation and the behavior of popular terminal emulators?

@stevengj
Copy link
Author

Ultimately, the utf8proc library decided to also report a width of 1 for U+00AD as well, in order to agree with other wcwidth implementations, and with typical terminal programs which display a soft hyphen as a visible - glyph.

@stevengj stevengj reopened this Mar 13, 2024
@avih
Copy link

avih commented Mar 13, 2024

I would guess that terminals measure its width according to the wcwidth implementation which they use? And I would also guess that typically that would be whatever libc provides?

Well, that was not a good argument, and I would agree that if this was the only or main wcwidth implementation, then it should try to match the common terminal emulators behavior.

But because this is one of several wcwidth implementations, its goal should be to agree with other wcwidth implementations rather the terminals.

That being said, it would still be nice to know how terminals handle it.

At which case, the test should be dual:

  • In the middle of a line - where semantically it should be 0.
  • Towards the end of a line, where Unicode suggests that if a word doesn't fit, then it should have width 1 with visible hyphen[-like] glyph, followed by newline.

I would guess that most terminals don't handle it dually like the Unicode semantics suggests (and would imply a -1 wcwidth value), hence they probably treat it as always 1 or always 0, though that's a guess.

@avih
Copy link

avih commented Mar 16, 2024

At which case, the test should be dual...

So, I tested it in the following terminals on Alpine linux 3.19.1, and all the tested terminal emulators treat it either as hard 0 or hard 1. I.e. no terminal handles it dually as 0 at the middle of the line and hyphen+wordbreak in a word which spills over the end of the line.

Specifically, I tested using this script, and observed the result on-screen (not automated). the SHY byte is always at this word xxx<SHY>yyy:

EDITED: THIS SCRIPT IS BROKEN AND THE RESULTS ARE INVALID. See fixed script at the next post.

test-shy.sh (broken)
#!/bin/sh

dots() {
    R=
    while [ ${#R} -lt $1 ]; do R=$R.; done
    echo "$R"
}

has() { command -v "$1" >/dev/null; }

nth() { shift $1; printf %s\\n "$1"; }

cols() {
      if [ "${COLUMNS-}" ]; then echo $COLUMNS
    elif has stty;       then nth 2 $(stty size)
    elif has ttysize;    then nth 1 $(ttysize)
    else echo 80; fi
}

cols=$(cols)
printf "$(dots $cols)\n\n"
printf "SHY mid line: aaa xxx\255yyy bbb\n\n"
printf "no SHY: $(dots $((cols - 16))) aaa xxxyyy bbb\n\n"
printf "SHY before last column: $(dots $((cols - 34))) aaa xxx\255yyy bbb\n\n"
printf "SHY at the last column: $(dots $((cols - 33))) aaa xxx\255yyy bbb\n\n"

All the terminals were invoked with UTF-8 locale, e.g.:

LC_ALL=en_US.UTF-8 xterm

Results:

xterm 388, VTE (tested {gnome,xfce4,lx}-terminal), konsole 23.08.4, and st 0.9: always display it as U+FFFD REPLACEMENT CHARACTER, as if wcwidth(0xad) == 1:

shy-xterm

urxvt: similat to xterm etc. above, but always displays it as a hyphen, as if wcwidth(0xad) == 1.

alacritty 0.12.3 and kitty 0.31.0: seem to ignore it at the input, as if wcwidth(0xad) == 0:

shy-alacritty

So while 1 is common, I don't think it's black and white.

So I would think the goal should be to match other wcwidth implementations, where the value appears to be 1 at least in glibc, musl, and utf8proc.

@avih
Copy link

avih commented Mar 18, 2024

Actually, the test script above is wrong. It printed the byte 0xad (which is invalid UTF-8 sequence) rather than the UTF-8 sequence for U+00AD - which is 0xc2 0xad.

This is the revised script:

fixed test-shy.sh
#!/bin/sh

sf="\302\255"  # printf fmt of UTF-8 of U+00AD SOFT-HYPHEN

dots() {
    R=
    while [ ${#R} -lt $1 ]; do R=$R.; done
    echo "$R"
}

has() { command -v "$1" >/dev/null; }

nth() { shift $1; printf %s\\n "$1"; }

cols() {
      if [ "$COLUMNS" ]; then echo $COLUMNS
    elif has stty;       then nth 2 $(stty size)
    elif has ttysize;    then nth 1 $(ttysize)
    else echo 80; fi
}

cols=$(cols)
printf "$(dots $cols)\n\n"
printf "SHY mid line: aaa xxx${sf}yyy bbb\n\n"
printf "no SHY: $(dots $((cols - 16))) aaa xxxyyy bbb\n\n"
printf "SHY before last column: $(dots $((cols - 34))) aaa xxx${sf}yyy bbb\n\n"
printf "SHY at the last column: $(dots $((cols - 33))) aaa xxx${sf}yyy bbb\n\n"

And these are the results at the various terminals (kitty doesn't have "kitty" at the title, and xfce4-terminal and gnome-terminal have the same result as lxterminal - as all are VTE-based):
soft-hyphen-terminals

Like before, this is on Alpine linux 3.19.1 with the terminals installed from the distro packages repository, and all terminals were invoked after exporting LC_ALL=en_US.UTF-8.

Results:

  • xterm, alacritty, st, and rxvt-unicode always display it as hard-hyphen, as if wcwidth(0xad) == 1.
  • VTE terminals (xfce4-terminal, gnome-terinal, lxterminal), and konsole always display it as hard space, as if wcwidth(0xad) == 1.
  • Kitty seems to ignore it at the input, as if wcwidth(0xad) == 0.

@avih
Copy link

avih commented Oct 2, 2024

Here's a summary of the U+00AD SOFY-HYPHEN behavior:

  • The original implementation by Markus Kuhn which this and many others are based on - http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c has it explicitly as width of 1. See the original comment from this file below.
  • This repository also had it originally as 1.
  • Following this issue wrong width for U+00AD #8, this repository changed it to 0 at commit 04d6d90 .
  • However, as demonstrated above, with the exception of kitty, all tested terminals agree that it's always 1, as well as many other wcwidth implementations, including git's.

Therefore I think it should be added/restored as an overriding exception - return 1 for 0x00ad, to reflect terminals behavior and align with other wcwidth implementations.

original comment by Markus Kuhn from the linked file:
/* The following two functions define the column width of an ISO 10646
 * character as follows:
 *
 *    - The null character (U+0000) has a column width of 0.
 *
 *    - Other C0/C1 control characters and DEL will lead to a return
 *      value of -1.
 *
 *    - Non-spacing and enclosing combining characters (general
 *      category code Mn or Me in the Unicode database) have a
 *      column width of 0.
 *
 *    - SOFT HYPHEN (U+00AD) has a column width of 1.
 *
 *    - Other format characters (general category code Cf in the Unicode
 *      database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
 *
 *    - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
 *      have a column width of 0.
 *
 *    - Spacing characters in the East Asian Wide (W) or East Asian
 *      Full-width (F) category as defined in Unicode Technical
 *      Report #11 have a column width of 2.
 *
 *    - All remaining characters (including all printable
 *      ISO 8859-1 and WGL4 characters, Unicode control characters,
 *      etc.) have a column width of 1.
...

@stevengj
Copy link
Author

stevengj commented Oct 3, 2024

utf8proc now returns 1 as well (JuliaStrings/utf8proc#135).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants