Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case conversion test fails on Alpine Linux #127

Closed
ararslan opened this issue Apr 18, 2018 · 34 comments
Closed

Case conversion test fails on Alpine Linux #127

ararslan opened this issue Apr 18, 2018 · 34 comments

Comments

@ararslan
Copy link
Member

Output from make check:

test/case
MISMATCH df != towupper(df) == 1e9e
line 0: utf8proc case conversion FAILED 1 tests.
make: *** [Makefile:148: check] Error 1

There are also an extraordinary number of mismatches with the system wcwidth, though the width tests still pass.

@stevengj
Copy link
Member

Is this our bug our a bug in Alpine Linux?

@ararslan
Copy link
Member Author

Unclear and I'm not sure how to figure it out, so I figured I'd open an issue here to see if anyone had any ideas.

@stevengj
Copy link
Member

stevengj commented Apr 19, 2018

What character are they disagreeing on, and what answer is utf8proc giving vs towupper? We probably should modify that error message to print a bit more info.

@ararslan
Copy link
Member Author

The error message is saying df != towupper(df) == 1e9e, so plugging that into Julia:

julia> Char(0xdf)
'ß': Unicode U+00df (category Ll: Letter, lowercase)

julia> Char(0x1e9e)
'ẞ': Unicode U+1e9e (category Lu: Letter, uppercase)

@stevengj
Copy link
Member

I'm confused, it looks like the error message is saying that utf8proc_toupper(0x00df) is giving 0x00df, which is not what utf8proc gives on my system (or yours, since Julia calls it).

@ararslan
Copy link
Member Author

Sure enough:

#include <stdio.h>
#include <inttypes.h>
#include <wctype.h>
#include "utf8proc.h"

int main(int argc, char** argv) {
    int32_t u = utf8proc_toupper(0x00df);
    int32_t w = towupper(0x00df);
    printf("utf8proc_toupper: %x\ntowupper: %x\n", u, w);
    return 0;
}

Output on Ubuntu:

utf8proc_toupper: df
towupper: df

Output on Alpine:

utf8proc_toupper: df
towupper: 1e9e

And yet,

julia> uppercase(Char(0xdf))
'ß': Unicode U+00df (category Ll: Letter, lowercase)

on both systems.

@ararslan
Copy link
Member Author

So I guess it's Alpine's fault for returning a bogus value from towupper.

@ararslan
Copy link
Member Author

cc @jirutka, maintainer of the utf8proc Alpine port, and @fabled, who maintains the musl-dev port (which contains, among other things, wctype.h).

@ararslan
Copy link
Member Author

Since this isn't a utf8proc issue, I'll close this. Thanks for humoring me, Steven. 😉

@stevengj
Copy link
Member

stevengj commented Apr 19, 2018

Unless the uppercase mapping of U+00df changed in Unicode 10 (since utf8proc currently uses the Unicode 9 tables)?

This U+00df page says that U+1E9E is a "nonstandard uppercase," though.

@fabled
Copy link

fabled commented Apr 20, 2018

cc @richfelker

@richfelker
Copy link

Regarding case mappings, this is intentional, not a bug:

https://git.musl-libc.org/cgit/musl/commit/src/ctype/towctrans.c?id=4674809bdf7a46041ac0152eea0a6363ceeca548

For wcwidth, I'd have to see what the mismatches are.

@stevengj
Copy link
Member

I agree that there is an argument for supporting the nonstandard uppercase form here.

@richfelker
Copy link

Whether "ẞ" or "SS" is preferred is subject to cultural considerations, but the C locale system cannot represent the latter mapping. "ß" is obviously not a correct uppercase form for "ß".

It's been a while since I delved into the Unicode stability policy, but my understanding is that they can't (by their own policy) add new case mappings for characters that previously lacked them; even if that's wrong, they may want to avoid adding a nominal case mapping to a single character when the mapping to a sequence "SS" may be preferred in some cultures. I don't think these considerations detract from mapping to "ẞ" being the right thing to do in the limitations of the C locale framework.

@stevengj
Copy link
Member

stevengj commented May 2, 2018

Regarding the width mismatches, I wonder if that is due to the treatment of east asian neutral? See #83

@richfelker
Copy link

That thread (#83) is a mess. If there are people who want to solve the problem of whether certain scripts (or some characters from certain scripts) should be treated as wcwidth=2, there needs to be an organized effort, outside of a single software project like this, involving actual users/experts of the affected scripts, not pulling values out of some random font file (unifont), and there should be interest from key implementors in supporting the outcome before the process begins. Until then, musl (and afaik, also glibc) take a simple approach and assign width=1 to everything except to characters that were explicitly wide in legacy CJK charsets.

@stevengj
Copy link
Member

stevengj commented May 2, 2018

You assign width=1 to combining characters?

@richfelker
Copy link

Sorry, I was not sufficiently precise. Of course nonspacing combining characters (Mn) and certain other nonspacing (most of Cf) characters are wcwidth=0, and control characters (nonprintable) are wcwidth=-1.

@stevengj
Copy link
Member

stevengj commented May 2, 2018

except to characters that were explicitly wide in legacy CJK charsets

I assume you mean "East Asian Wide" characters in UAX#11? These aren't just legacy charsets — emoji were changed to wide in Unicode 9 IIRC.

@richfelker
Copy link

musl's definitions are derived programmatically from EastAsianWidth.txt from Unicode 10.0, and I don't see any emoji marked as wide in it. Aside from actual ideographic characters, the only characters I'm aware of which are marked full/wide are the ones present in legacy charsets.

@stevengj
Copy link
Member

stevengj commented May 2, 2018

https://www.unicode.org/reports/tr11/tr11-31.html#ED4 says that characters with the property Emoji_Presentation should be treated as East Asian Wide

@richfelker
Copy link

Well it says they're classified as such, and in fact the ones with Emoji_Presentation are in EastAsianWidth.txt so they should already be marked wcwidth=2 in musl. I don't know why I didn't notice them before. I think that omits a lot of characters I thought of as "emoji" including classic dingbats etc.

@ararslan
Copy link
Member Author

ararslan commented May 2, 2018

Until then, musl (and afaik, also glibc) take a simple approach and assign width=1 to everything except to characters that were explicitly wide in legacy CJK charsets.

Indeed, the width mismatches on Alpine stem from Alpine treating a lot of things as wcwidth 1 where we treat them as 0, 1, or 2 (from a quick skim over the sea of output).

I don't believe it's the case with glibc though; I get no mismatches running make check on Ubuntu. Unless I'm misunderstanding what you mean.

@stevengj
Copy link
Member

stevengj commented May 2, 2018

@ararslan, can you give an example of a character we assign width 0 where musl gives wcwidth > 0?

@ararslan
Copy link
Member Author

ararslan commented May 2, 2018

The full list of mismatches, all 340,000+ lines, is here: https://gist.github.com/ararslan/c7dfbfb0f9dff42940a394c79be0afe3

Taking the first few entries from there:

julia> Char(0xad)
'\uad': Unicode U+00ad (category Cf: Other, format)

julia> Char(0x378)
'\u378': Unicode U+0378 (category Cn: Other, not assigned)

julia> Char(0x379)
'\u379': Unicode U+0379 (category Cn: Other, not assigned)

julia> Char(0x380)
'\u380': Unicode U+0380 (category Cn: Other, not assigned)

It looks like musl assigns width 1 to unassigned code points.

@ararslan
Copy link
Member Author

ararslan commented May 2, 2018

This is an interesting example where musl gives 0 and we give 2:

julia> Char(0x601)
'\u601': Unicode U+0601 (category Cf: Other, format)

@stevengj
Copy link
Member

stevengj commented May 2, 2018

U+00ad is a soft hyphen, which is an interesting case. In some contexts it is used as a hyphenation hint and is not displayed, and in other cases it is displayed. Many terminal environments do display it, but this is not required.

U+0380 and several of the other characters are unassigned code points. It's not at all obvious what width we should use for these. I suppose, from a probabilistic standpoint, an unassigned codepoint is probably more likely to be used for a width-1 character (e.g. a private encoding like Conscript) than a width-0 character. Also, the replacement character u+fffd has width 1.

U+0601 certainly doesn't look like zero width, and isn't rendered as zero width by any font that I have (e.g. a؁b).

@richfelker
Copy link

I don't know the situation with U+0601. It's probably a case of needing a special-casing. Unicode class Cf is a mess of inconsistency, and while most of them are nonspacing marks or printable formatting controls, apparently some are spacing characters too. U+00AD is already handled specially here (by historical practice it's spacing in charcell terminals, which is what wcwidth is accounting for) and U+0601 probably should be too. I don't know what evidence there is for treating it as wcwidth=2 though. That brings us back to the need for some organized review effort.

@stevengj
Copy link
Member

stevengj commented May 3, 2018

Note that there already is an "organized review effort, outside of any single software project", sponsored by the GNU project, carried out by both language and typography experts, to identify standards-conformant terminal-compatible font metrics and glyphs for the entirety of Unicode. The result is called GNU Unifont, and it is updated every time Unicode is updated.

The problem is not a lack of data or a lack of review, but rather it is the xkcd standards problem of getting everyone, up and down the stack, to agree on which data to use.

@richfelker
Copy link

I was not aware of any such aspect to the GNU Unifont project, and wasn't even aware that it's still maintained. If it's really trying to act as a standards process for character-cell metrics, that sounds great, but there seems to be a serious lack of publicity around it. I can't even find anything supporting that claim on their website. Last I looked at it, the glyphs for many scripts were not actually usable for writing using them, and many were double-width just because it turned out to be easier to draw something nice looking in 16x16 than 8x16 for certain characters.

@stevengj
Copy link
Member

stevengj commented May 3, 2018

Their release notes make it pretty clear that Unifont is actively maintained (and is promptly updated every time Unicode is updated); I'm not sure why you would think otherwise. It is targeted especially at low-resolution displays, it's true, but that is precisely the situation most appropriate for terminals (which often use the minimum readable font size).

As for trying to act as a "standards process", now you're raising the bar. That requires buy-in from libc maintainers, who currently seem to be rejecting out-of-hand any attempt to go beyond EastAsianWidth.txt for glyph metrics.

@richfelker
Copy link

The above comment that started this:

#127 (comment)

was specifically about acting as a "standards process" with buy-in from implementors (wcwidth implementors being libc, also terminals, screen/tmux, etc.). I am not rejecting attempts to define a better wcwidth out-of-hand. I'm rejecting attempts to claim that unilateral decisions by a party with almost no stake and almost no input from users of the affected scripts should be a basis for our decisions.

@stevengj
Copy link
Member

stevengj commented May 3, 2018

a party with almost no stake and almost no input from users of the affected scripts

I'm not sure this is an accurate description of the Unifont developers.

Anyway, though we don't use it as the sole basis of our decisions, it seems like a more reasonable starting point than "all characters have width 1", which employs no input at all from users of the affected scripts.

Reasonable people can disagree about this, of course, but I don't think it's completely crazy for us to incorporate data from Unifont in determining charwidth for cases that are ambiguous in the Unicode standard.

@stevengj
Copy link
Member

stevengj commented May 3, 2018

By the way, regarding emoji being width 2, the lack of recognition of this prior to Unicode 9 led to this amusing issue in Julia that directly led to our attempt to get better charwidth tables: JuliaLang/julia#3721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants