charwidth=1 for soft hyphen and unassigned codepoints #135

stevengj · 2018-05-03T20:22:19Z

Changes U+00ad (soft hyphen) to have charwidth=1. As discussed in the Unicode FAQ and in this article, terminal contexts typically display this as a visible hyphen - for historical reasons, and since charwidth is mainly for such contexts it makes sense to report it as width=1.

The other change in this PR is to report charwidth=1 (previously 0) for unassigned and PUA codepoints. We have to report some width for these, and 1 seems like a better guess than 0 (since 0-width characters are much rarer). Also, a terminal is likely to display these as the replacement character U+FFFD, which has width 1.

@ararslan, this should reduce the number of disagreements with musl (discussed in #127). Are there any remaining conflicts where we give width 0 and musl gives width > 0? Also vice versa, but we know that there will be a few of those like U+0601. (There will still be conflicts where we disagree about whether the width is 1 or 2, but we already know about this.)

ararslan · 2018-05-03T22:03:51Z

Yeah there are still 2,945 cases where musl wcwidth > 0 and utf8proc charwidth == 0. The first few examples:

│ Row │ char  │ musl │ utf8proc │
├─────┼───────┼──────┼──────────┤
│ 1   │ 0x605 │ 1    │ 0        │
│ 2   │ 0x61c │ 1    │ 0        │
│ 3   │ 0x8d4 │ 1    │ 0        │
│ 4   │ 0x8d5 │ 1    │ 0        │
│ 5   │ 0x8d6 │ 1    │ 0        │
│ 6   │ 0x8d7 │ 1    │ 0        │

There are apparently now 655,503 cases where musl wcwidth == 0 and utf8proc charwidth > 0. The first few:

│ Row │ char    │ musl │ utf8proc │
├─────┼─────────┼──────┼──────────┤
│ 1   │ 0x601   │ 0    │ 2        │
│ 2   │ 0x602   │ 0    │ 2        │
│ 3   │ 0x603   │ 0    │ 2        │
│ 4   │ 0x6dd   │ 0    │ 2        │
│ 5   │ 0x40000 │ 0    │ 1        │
│ 6   │ 0x40001 │ 0    │ 1        │

stevengj · 2018-05-04T16:36:20Z

According to Unicode 10, section 9.2 (Arabic)

U+0605 arabic number mark above is a specialized variant of number sign. It is used in Arabic text with Coptic numbers, such as in early astronomical tables. Unlike the other Arabic number signs, it extends across the top of the sequence of digits, and is used with Coptic digits, rather than with Arabic digits. (See also the discussion of supralineation and the numerical use of letters in Section 7.3, Coptic.)

and in section 22.3 (Numerals),

Ordinary Coptic numbers are often distinguished from Coptic letters by marking them with a line above. (See Section 7.3, Coptic.) A visually similar convention is also seen for Coptic epact numbers, where an entire numeric sequence may be marked with a wavy line above. This mark is represented by U+0605 arabic number mark above. As when used with Arabic digits, arabic number mark above precedes the sequence of Coptic epact numbers in the underlying representation, and is rendered across the top of the entire sequence for display.

So it seems like a zero width for U+605 (category Cf) is correct. It is rendered above the subsequent digits, rather than taking up space on its own. (It is not a combining character since it affects a sequence of subsequent digits rather than preceding digits.)

Unicode 10.0 section 23.2 (Layout Controls) says

U+200E left-to-right mark, U+200F right-to-left mark, and U+061C arabic letter mark have the semantics of an invisible character of zero width, except that these characters have strong directionality.

So I think assigning zero-width to the (category Cf) character u+061c is again correct.

Codepoint U+08d5 is a combining character in category Mn (Mark, nonspacing). (It's a kind of ligature, as I understand it.) Normally, category-Mn characters take no space on their own (to display them on their own, Unicode section 2.11 recommends placing them after U+00A0 non-breaking-space). So, a zero width seems appropriate again.

The case of U+0601 was discussed in #127, and a nonzero width seemed correct, but now I'm not sure. Reading section 9.2 of the Unicode standard, it says

U+0601 arabic sign sanah indicates a year (that is, as part of a date). This sign is also ren- dered below the digits of the number it precedes.

and the standard shows the example:

Similarly for the other category-Cf characters U+602, U+603, and U+604. So, it seems like (similar to U+605), it should have zero width. Similarly for U+06dd (which "encloses" the subsequent digit sequence).

In general, I'm not sure there's much practical interest in terminal rendering of these cursive Arabic scripts, which have complicated context-dependent renderings. But it doesn't seem reasonable to give a different width for U+601 and U+605, since both are formatting characters that change the rendering of subsequent digits.

The justification for the current rule used by utf8proc was by @jiahao in http://nbviewer.jupyter.org/gist/jiahao/07e8b08bf6d8671e9734 (for #27) and comes from the Unicode standard, in what is now section 9.2 (Arabic):

Unlike most other format characters, however, they should be rendered with a visible glyph, even in circumstances where no suitable digit or sequence of digits follows them in logical order.

(U+605 did not exist at the time of #27.) So, these formatting characters take no space when they are used in the intended context, preceding digits in the appropriate script, but take up space in other contexts? Clearly, we should adopt the same rule for all of them, and my inclination is to assign them zero width — that will be the right thing when they are used "correctly".

(Again, I'm not sure how much it matters — terminal support for Arabic rendering seems to be very limited anyway.)

U+40000+ is an unassigned plane. As mentioned above, we intentionally give unassigned and PUA chars width=1 in this PR, for the reasons given above.

…of numbers, which are sometimes zero-width and sometimes not

stevengj · 2018-07-23T19:35:23Z

Rebased.

stevengj mentioned this pull request May 29, 2018

update to unicode 10 #132

Merged

stevengj closed this Jul 23, 2018

stevengj reopened this Jul 23, 2018

stevengj added 6 commits July 23, 2018 15:16

use width=1 for soft hyphen and for unassigned/PUA codepoints

fc8ef63

don't count unassigned codepoints when comparing with system wcwidth

70fa75c

more tests

f0639a4

indentation fixes

24d371d

NEWS for 135

d6b1846

remove special-casing for arabic control characters affecting a span …

c5d09ba

…of numbers, which are sometimes zero-width and sometimes not

stevengj force-pushed the newwidths branch from 2097dec to c5d09ba Compare July 23, 2018 19:35

regenerate

630b78b

stevengj merged commit 02f4e18 into master Jul 24, 2018

stevengj deleted the newwidths branch July 24, 2018 14:45

stevengj mentioned this pull request Oct 3, 2024

wrong width for U+00AD jquast/wcwidth#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

charwidth=1 for soft hyphen and unassigned codepoints #135

charwidth=1 for soft hyphen and unassigned codepoints #135

stevengj commented May 3, 2018 •

edited

Loading

ararslan commented May 3, 2018

stevengj commented May 4, 2018 •

edited

Loading

stevengj commented Jul 23, 2018

charwidth=1 for soft hyphen and unassigned codepoints #135

charwidth=1 for soft hyphen and unassigned codepoints #135

Conversation

stevengj commented May 3, 2018 • edited Loading

ararslan commented May 3, 2018

stevengj commented May 4, 2018 • edited Loading

stevengj commented Jul 23, 2018

stevengj commented May 3, 2018 •

edited

Loading

stevengj commented May 4, 2018 •

edited

Loading