Case conversion test fails on Alpine Linux #127

ararslan · 2018-04-18T23:57:30Z

Output from make check:

test/case
MISMATCH df != towupper(df) == 1e9e
line 0: utf8proc case conversion FAILED 1 tests.
make: *** [Makefile:148: check] Error 1

There are also an extraordinary number of mismatches with the system wcwidth, though the width tests still pass.

The text was updated successfully, but these errors were encountered:

stevengj · 2018-04-19T15:11:33Z

Is this our bug our a bug in Alpine Linux?

ararslan · 2018-04-19T15:13:19Z

Unclear and I'm not sure how to figure it out, so I figured I'd open an issue here to see if anyone had any ideas.

stevengj · 2018-04-19T15:19:08Z

What character are they disagreeing on, and what answer is utf8proc giving vs towupper? We probably should modify that error message to print a bit more info.

ararslan · 2018-04-19T15:30:46Z

The error message is saying df != towupper(df) == 1e9e, so plugging that into Julia:

julia> Char(0xdf)
'ß': Unicode U+00df (category Ll: Letter, lowercase)

julia> Char(0x1e9e)
'ẞ': Unicode U+1e9e (category Lu: Letter, uppercase)

stevengj · 2018-04-19T19:46:16Z

I'm confused, it looks like the error message is saying that utf8proc_toupper(0x00df) is giving 0x00df, which is not what utf8proc gives on my system (or yours, since Julia calls it).

ararslan · 2018-04-19T20:29:27Z

Sure enough:

#include <stdio.h>
#include <inttypes.h>
#include <wctype.h>
#include "utf8proc.h"

int main(int argc, char** argv) {
    int32_t u = utf8proc_toupper(0x00df);
    int32_t w = towupper(0x00df);
    printf("utf8proc_toupper: %x\ntowupper: %x\n", u, w);
    return 0;
}

Output on Ubuntu:

utf8proc_toupper: df
towupper: df

Output on Alpine:

utf8proc_toupper: df
towupper: 1e9e

And yet,

julia> uppercase(Char(0xdf))
'ß': Unicode U+00df (category Ll: Letter, lowercase)

on both systems.

ararslan · 2018-04-19T20:34:38Z

So I guess it's Alpine's fault for returning a bogus value from towupper.

ararslan · 2018-04-19T20:40:42Z

cc @jirutka, maintainer of the utf8proc Alpine port, and @fabled, who maintains the musl-dev port (which contains, among other things, wctype.h).

ararslan · 2018-04-19T20:44:15Z

Since this isn't a utf8proc issue, I'll close this. Thanks for humoring me, Steven. 😉

stevengj · 2018-04-19T23:17:26Z

Unless the uppercase mapping of U+00df changed in Unicode 10 (since utf8proc currently uses the Unicode 9 tables)?

This U+00df page says that U+1E9E is a "nonstandard uppercase," though.

fabled · 2018-04-20T05:27:01Z

cc @richfelker

richfelker · 2018-04-20T13:45:35Z

Regarding case mappings, this is intentional, not a bug:

https://git.musl-libc.org/cgit/musl/commit/src/ctype/towctrans.c?id=4674809bdf7a46041ac0152eea0a6363ceeca548

For wcwidth, I'd have to see what the mismatches are.

stevengj · 2018-04-20T14:37:08Z

I agree that there is an argument for supporting the nonstandard uppercase form here.

richfelker · 2018-04-20T14:52:15Z

Whether "ẞ" or "SS" is preferred is subject to cultural considerations, but the C locale system cannot represent the latter mapping. "ß" is obviously not a correct uppercase form for "ß".

It's been a while since I delved into the Unicode stability policy, but my understanding is that they can't (by their own policy) add new case mappings for characters that previously lacked them; even if that's wrong, they may want to avoid adding a nominal case mapping to a single character when the mapping to a sequence "SS" may be preferred in some cultures. I don't think these considerations detract from mapping to "ẞ" being the right thing to do in the limitations of the C locale framework.

stevengj · 2018-05-02T13:26:05Z

Regarding the width mismatches, I wonder if that is due to the treatment of east asian neutral? See #83

richfelker · 2018-05-02T14:27:07Z

That thread (#83) is a mess. If there are people who want to solve the problem of whether certain scripts (or some characters from certain scripts) should be treated as wcwidth=2, there needs to be an organized effort, outside of a single software project like this, involving actual users/experts of the affected scripts, not pulling values out of some random font file (unifont), and there should be interest from key implementors in supporting the outcome before the process begins. Until then, musl (and afaik, also glibc) take a simple approach and assign width=1 to everything except to characters that were explicitly wide in legacy CJK charsets.

stevengj · 2018-05-02T14:30:55Z

You assign width=1 to combining characters?

richfelker · 2018-05-02T15:09:51Z

Sorry, I was not sufficiently precise. Of course nonspacing combining characters (Mn) and certain other nonspacing (most of Cf) characters are wcwidth=0, and control characters (nonprintable) are wcwidth=-1.

stevengj · 2018-05-02T16:02:42Z

except to characters that were explicitly wide in legacy CJK charsets

I assume you mean "East Asian Wide" characters in UAX#11? These aren't just legacy charsets — emoji were changed to wide in Unicode 9 IIRC.

richfelker · 2018-05-02T16:13:30Z

musl's definitions are derived programmatically from EastAsianWidth.txt from Unicode 10.0, and I don't see any emoji marked as wide in it. Aside from actual ideographic characters, the only characters I'm aware of which are marked full/wide are the ones present in legacy charsets.

stevengj · 2018-05-02T16:48:42Z

https://www.unicode.org/reports/tr11/tr11-31.html#ED4 says that characters with the property Emoji_Presentation should be treated as East Asian Wide

richfelker · 2018-05-02T17:02:23Z

Well it says they're classified as such, and in fact the ones with Emoji_Presentation are in EastAsianWidth.txt so they should already be marked wcwidth=2 in musl. I don't know why I didn't notice them before. I think that omits a lot of characters I thought of as "emoji" including classic dingbats etc.

ararslan · 2018-05-02T18:15:47Z

Until then, musl (and afaik, also glibc) take a simple approach and assign width=1 to everything except to characters that were explicitly wide in legacy CJK charsets.

Indeed, the width mismatches on Alpine stem from Alpine treating a lot of things as wcwidth 1 where we treat them as 0, 1, or 2 (from a quick skim over the sea of output).

I don't believe it's the case with glibc though; I get no mismatches running make check on Ubuntu. Unless I'm misunderstanding what you mean.

stevengj · 2018-05-02T18:21:42Z

@ararslan, can you give an example of a character we assign width 0 where musl gives wcwidth > 0?

ararslan · 2018-05-02T18:38:52Z

The full list of mismatches, all 340,000+ lines, is here: https://gist.github.com/ararslan/c7dfbfb0f9dff42940a394c79be0afe3

Taking the first few entries from there:

julia> Char(0xad)
'\uad': Unicode U+00ad (category Cf: Other, format)

julia> Char(0x378)
'\u378': Unicode U+0378 (category Cn: Other, not assigned)

julia> Char(0x379)
'\u379': Unicode U+0379 (category Cn: Other, not assigned)

julia> Char(0x380)
'\u380': Unicode U+0380 (category Cn: Other, not assigned)

It looks like musl assigns width 1 to unassigned code points.

ararslan · 2018-05-02T18:44:28Z

This is an interesting example where musl gives 0 and we give 2:

julia> Char(0x601)
'\u601': Unicode U+0601 (category Cf: Other, format)

stevengj · 2018-05-02T18:59:48Z

U+00ad is a soft hyphen, which is an interesting case. In some contexts it is used as a hyphenation hint and is not displayed, and in other cases it is displayed. Many terminal environments do display it, but this is not required.

U+0380 and several of the other characters are unassigned code points. It's not at all obvious what width we should use for these. I suppose, from a probabilistic standpoint, an unassigned codepoint is probably more likely to be used for a width-1 character (e.g. a private encoding like Conscript) than a width-0 character. Also, the replacement character u+fffd has width 1.

U+0601 certainly doesn't look like zero width, and isn't rendered as zero width by any font that I have (e.g. a؁b).

richfelker · 2018-05-02T19:20:36Z

I don't know the situation with U+0601. It's probably a case of needing a special-casing. Unicode class Cf is a mess of inconsistency, and while most of them are nonspacing marks or printable formatting controls, apparently some are spacing characters too. U+00AD is already handled specially here (by historical practice it's spacing in charcell terminals, which is what wcwidth is accounting for) and U+0601 probably should be too. I don't know what evidence there is for treating it as wcwidth=2 though. That brings us back to the need for some organized review effort.

stevengj · 2018-05-03T15:34:03Z

Note that there already is an "organized review effort, outside of any single software project", sponsored by the GNU project, carried out by both language and typography experts, to identify standards-conformant terminal-compatible font metrics and glyphs for the entirety of Unicode. The result is called GNU Unifont, and it is updated every time Unicode is updated.

The problem is not a lack of data or a lack of review, but rather it is the xkcd standards problem of getting everyone, up and down the stack, to agree on which data to use.

richfelker · 2018-05-03T15:50:58Z

I was not aware of any such aspect to the GNU Unifont project, and wasn't even aware that it's still maintained. If it's really trying to act as a standards process for character-cell metrics, that sounds great, but there seems to be a serious lack of publicity around it. I can't even find anything supporting that claim on their website. Last I looked at it, the glyphs for many scripts were not actually usable for writing using them, and many were double-width just because it turned out to be easier to draw something nice looking in 16x16 than 8x16 for certain characters.

stevengj · 2018-05-03T16:19:58Z

Their release notes make it pretty clear that Unifont is actively maintained (and is promptly updated every time Unicode is updated); I'm not sure why you would think otherwise. It is targeted especially at low-resolution displays, it's true, but that is precisely the situation most appropriate for terminals (which often use the minimum readable font size).

As for trying to act as a "standards process", now you're raising the bar. That requires buy-in from libc maintainers, who currently seem to be rejecting out-of-hand any attempt to go beyond EastAsianWidth.txt for glyph metrics.

richfelker · 2018-05-03T17:10:59Z

The above comment that started this:

#127 (comment)

was specifically about acting as a "standards process" with buy-in from implementors (wcwidth implementors being libc, also terminals, screen/tmux, etc.). I am not rejecting attempts to define a better wcwidth out-of-hand. I'm rejecting attempts to claim that unilateral decisions by a party with almost no stake and almost no input from users of the affected scripts should be a basis for our decisions.

stevengj · 2018-05-03T17:41:59Z

a party with almost no stake and almost no input from users of the affected scripts

I'm not sure this is an accurate description of the Unifont developers.

Anyway, though we don't use it as the sole basis of our decisions, it seems like a more reasonable starting point than "all characters have width 1", which employs no input at all from users of the affected scripts.

Reasonable people can disagree about this, of course, but I don't think it's completely crazy for us to incorporate data from Unifont in determining charwidth for cases that are ambiguous in the Unicode standard.

stevengj · 2018-05-03T21:24:56Z

By the way, regarding emoji being width 2, the lack of recognition of this prior to Unicode 9 led to this amusing issue in Julia that directly led to our attempt to get better charwidth tables: JuliaLang/julia#3721

ararslan closed this as completed Apr 19, 2018

stevengj mentioned this issue Apr 23, 2018

use uppercase mapping ß (U+00df) to ẞ (U+1E9E) #130

Closed

stevengj mentioned this issue May 3, 2018

charwidth=1 for soft hyphen and unassigned codepoints #135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case conversion test fails on Alpine Linux #127

Case conversion test fails on Alpine Linux #127

ararslan commented Apr 18, 2018

stevengj commented Apr 19, 2018

ararslan commented Apr 19, 2018

stevengj commented Apr 19, 2018 •

edited

Loading

ararslan commented Apr 19, 2018

stevengj commented Apr 19, 2018

ararslan commented Apr 19, 2018

ararslan commented Apr 19, 2018

ararslan commented Apr 19, 2018

ararslan commented Apr 19, 2018

stevengj commented Apr 19, 2018 •

edited

Loading

fabled commented Apr 20, 2018

richfelker commented Apr 20, 2018

stevengj commented Apr 20, 2018

richfelker commented Apr 20, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

ararslan commented May 2, 2018

stevengj commented May 2, 2018 •

edited

Loading

ararslan commented May 2, 2018

ararslan commented May 2, 2018

stevengj commented May 2, 2018 •

edited

Loading

richfelker commented May 2, 2018

stevengj commented May 3, 2018 •

edited

Loading

richfelker commented May 3, 2018

stevengj commented May 3, 2018 •

edited

Loading

richfelker commented May 3, 2018

stevengj commented May 3, 2018 •

edited

Loading

stevengj commented May 3, 2018

Case conversion test fails on Alpine Linux #127

Case conversion test fails on Alpine Linux #127

Comments

ararslan commented Apr 18, 2018

stevengj commented Apr 19, 2018

ararslan commented Apr 19, 2018

stevengj commented Apr 19, 2018 • edited Loading

ararslan commented Apr 19, 2018

stevengj commented Apr 19, 2018

ararslan commented Apr 19, 2018

ararslan commented Apr 19, 2018

ararslan commented Apr 19, 2018

ararslan commented Apr 19, 2018

stevengj commented Apr 19, 2018 • edited Loading

fabled commented Apr 20, 2018

richfelker commented Apr 20, 2018

stevengj commented Apr 20, 2018

richfelker commented Apr 20, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

stevengj commented May 2, 2018

richfelker commented May 2, 2018

ararslan commented May 2, 2018

stevengj commented May 2, 2018 • edited Loading

ararslan commented May 2, 2018

ararslan commented May 2, 2018

stevengj commented May 2, 2018 • edited Loading

richfelker commented May 2, 2018

stevengj commented May 3, 2018 • edited Loading

richfelker commented May 3, 2018

stevengj commented May 3, 2018 • edited Loading

richfelker commented May 3, 2018

stevengj commented May 3, 2018 • edited Loading

stevengj commented May 3, 2018

stevengj commented Apr 19, 2018 •

edited

Loading

stevengj commented Apr 19, 2018 •

edited

Loading

stevengj commented May 2, 2018 •

edited

Loading

stevengj commented May 2, 2018 •

edited

Loading

stevengj commented May 3, 2018 •

edited

Loading

stevengj commented May 3, 2018 •

edited

Loading

stevengj commented May 3, 2018 •

edited

Loading