Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charwidth=1 for soft hyphen and unassigned codepoints #135

Merged
merged 7 commits into from
Jul 24, 2018
Merged

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented May 3, 2018

Changes U+00ad (soft hyphen) to have charwidth=1. As discussed in the Unicode FAQ and in this article, terminal contexts typically display this as a visible hyphen - for historical reasons, and since charwidth is mainly for such contexts it makes sense to report it as width=1.

The other change in this PR is to report charwidth=1 (previously 0) for unassigned and PUA codepoints. We have to report some width for these, and 1 seems like a better guess than 0 (since 0-width characters are much rarer). Also, a terminal is likely to display these as the replacement character U+FFFD, which has width 1.

@ararslan, this should reduce the number of disagreements with musl (discussed in #127). Are there any remaining conflicts where we give width 0 and musl gives width > 0? Also vice versa, but we know that there will be a few of those like U+0601. (There will still be conflicts where we disagree about whether the width is 1 or 2, but we already know about this.)

@ararslan
Copy link
Member

ararslan commented May 3, 2018

Yeah there are still 2,945 cases where musl wcwidth > 0 and utf8proc charwidth == 0. The first few examples:

│ Row │ char  │ musl │ utf8proc │
├─────┼───────┼──────┼──────────┤
│ 1   │ 0x605 │ 1    │ 0        │
│ 2   │ 0x61c │ 1    │ 0        │
│ 3   │ 0x8d4 │ 1    │ 0        │
│ 4   │ 0x8d5 │ 1    │ 0        │
│ 5   │ 0x8d6 │ 1    │ 0        │
│ 6   │ 0x8d7 │ 1    │ 0        │

There are apparently now 655,503 cases where musl wcwidth == 0 and utf8proc charwidth > 0. The first few:

│ Row │ char    │ musl │ utf8proc │
├─────┼─────────┼──────┼──────────┤
│ 1   │ 0x601   │ 0    │ 2        │
│ 2   │ 0x602   │ 0    │ 2        │
│ 3   │ 0x603   │ 0    │ 2        │
│ 4   │ 0x6dd   │ 0    │ 2        │
│ 5   │ 0x40000 │ 0    │ 1        │
│ 6   │ 0x40001 │ 0    │ 1        │

@stevengj
Copy link
Member Author

stevengj commented May 4, 2018

According to Unicode 10, section 9.2 (Arabic)

U+0605 arabic number mark above is a specialized variant of number sign. It is used in Arabic text with Coptic numbers, such as in early astronomical tables. Unlike the other Arabic number signs, it extends across the top of the sequence of digits, and is used with Coptic digits, rather than with Arabic digits. (See also the discussion of supralineation and the numerical use of letters in Section 7.3, Coptic.)

and in section 22.3 (Numerals),

Ordinary Coptic numbers are often distinguished from Coptic letters by marking them with a line above. (See Section 7.3, Coptic.) A visually similar convention is also seen for Coptic epact numbers, where an entire numeric sequence may be marked with a wavy line above. This mark is represented by U+0605 arabic number mark above. As when used with Arabic digits, arabic number mark above precedes the sequence of Coptic epact numbers in the underlying representation, and is rendered across the top of the entire sequence for display.

So it seems like a zero width for U+605 (category Cf) is correct. It is rendered above the subsequent digits, rather than taking up space on its own. (It is not a combining character since it affects a sequence of subsequent digits rather than preceding digits.)

Unicode 10.0 section 23.2 (Layout Controls) says

U+200E left-to-right mark, U+200F right-to-left mark, and U+061C arabic letter mark have the semantics of an invisible character of zero width, except that these characters have strong directionality.

So I think assigning zero-width to the (category Cf) character u+061c is again correct.

Codepoint U+08d5 is a combining character in category Mn (Mark, nonspacing). (It's a kind of ligature, as I understand it.) Normally, category-Mn characters take no space on their own (to display them on their own, Unicode section 2.11 recommends placing them after U+00A0 non-breaking-space). So, a zero width seems appropriate again.

The case of U+0601 was discussed in #127, and a nonzero width seemed correct, but now I'm not sure. Reading section 9.2 of the Unicode standard, it says

U+0601 arabic sign sanah indicates a year (that is, as part of a date). This sign is also ren- dered below the digits of the number it precedes.

and the standard shows the example:
image
Similarly for the other category-Cf characters U+602, U+603, and U+604. So, it seems like (similar to U+605), it should have zero width. Similarly for U+06dd (which "encloses" the subsequent digit sequence).

In general, I'm not sure there's much practical interest in terminal rendering of these cursive Arabic scripts, which have complicated context-dependent renderings. But it doesn't seem reasonable to give a different width for U+601 and U+605, since both are formatting characters that change the rendering of subsequent digits.

The justification for the current rule used by utf8proc was by @jiahao in http://nbviewer.jupyter.org/gist/jiahao/07e8b08bf6d8671e9734 (for #27) and comes from the Unicode standard, in what is now section 9.2 (Arabic):

Unlike most other format characters, however, they should be rendered with a visible glyph, even in circumstances where no suitable digit or sequence of digits follows them in logical order.

(U+605 did not exist at the time of #27.) So, these formatting characters take no space when they are used in the intended context, preceding digits in the appropriate script, but take up space in other contexts? Clearly, we should adopt the same rule for all of them, and my inclination is to assign them zero width — that will be the right thing when they are used "correctly".

(Again, I'm not sure how much it matters — terminal support for Arabic rendering seems to be very limited anyway.)

U+40000+ is an unassigned plane. As mentioned above, we intentionally give unassigned and PUA chars width=1 in this PR, for the reasons given above.

@stevengj stevengj mentioned this pull request May 29, 2018
@stevengj stevengj closed this Jul 23, 2018
@stevengj stevengj reopened this Jul 23, 2018
@stevengj
Copy link
Member Author

Rebased.

@stevengj stevengj merged commit 02f4e18 into master Jul 24, 2018
@stevengj stevengj deleted the newwidths branch July 24, 2018 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants