-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
charwidth=1 for soft hyphen and unassigned codepoints #135
Conversation
Yeah there are still 2,945 cases where musl wcwidth > 0 and utf8proc charwidth == 0. The first few examples:
There are apparently now 655,503 cases where musl wcwidth == 0 and utf8proc charwidth > 0. The first few:
|
According to Unicode 10, section 9.2 (Arabic)
and in section 22.3 (Numerals),
So it seems like a zero width for U+605 (category Cf) is correct. It is rendered above the subsequent digits, rather than taking up space on its own. (It is not a combining character since it affects a sequence of subsequent digits rather than preceding digits.) Unicode 10.0 section 23.2 (Layout Controls) says
So I think assigning zero-width to the (category Cf) character u+061c is again correct. Codepoint U+08d5 is a combining character in category Mn (Mark, nonspacing). (It's a kind of ligature, as I understand it.) Normally, category-Mn characters take no space on their own (to display them on their own, Unicode section 2.11 recommends placing them after U+00A0 non-breaking-space). So, a zero width seems appropriate again. The case of U+0601 was discussed in #127, and a nonzero width seemed correct, but now I'm not sure. Reading section 9.2 of the Unicode standard, it says
and the standard shows the example: In general, I'm not sure there's much practical interest in terminal rendering of these cursive Arabic scripts, which have complicated context-dependent renderings. But it doesn't seem reasonable to give a different width for U+601 and U+605, since both are formatting characters that change the rendering of subsequent digits. The justification for the current rule used by utf8proc was by @jiahao in http://nbviewer.jupyter.org/gist/jiahao/07e8b08bf6d8671e9734 (for #27) and comes from the Unicode standard, in what is now section 9.2 (Arabic):
(U+605 did not exist at the time of #27.) So, these formatting characters take no space when they are used in the intended context, preceding digits in the appropriate script, but take up space in other contexts? Clearly, we should adopt the same rule for all of them, and my inclination is to assign them zero width — that will be the right thing when they are used "correctly". (Again, I'm not sure how much it matters — terminal support for Arabic rendering seems to be very limited anyway.) U+40000+ is an unassigned plane. As mentioned above, we intentionally give unassigned and PUA chars width=1 in this PR, for the reasons given above. |
…of numbers, which are sometimes zero-width and sometimes not
Rebased. |
Changes U+00ad (soft hyphen) to have charwidth=1. As discussed in the Unicode FAQ and in this article, terminal contexts typically display this as a visible hyphen
-
for historical reasons, and sincecharwidth
is mainly for such contexts it makes sense to report it as width=1.The other change in this PR is to report charwidth=1 (previously 0) for unassigned and PUA codepoints. We have to report some width for these, and 1 seems like a better guess than 0 (since 0-width characters are much rarer). Also, a terminal is likely to display these as the replacement character U+FFFD, which has width 1.
@ararslan, this should reduce the number of disagreements with musl (discussed in #127). Are there any remaining conflicts where we give width 0 and musl gives width > 0? Also vice versa, but we know that there will be a few of those like U+0601. (There will still be conflicts where we disagree about whether the width is 1 or 2, but we already know about this.)