set east asian neutral width to 1 #83

joshuarubin · 2016-08-30T08:03:25Z

According to http://unicode.org/reports/tr11/#Recommendations, east asian neutral width characters should always map to either halfwidth or regular (narrow) characters.

Runes like U+0CA8 (ನ), U+0CB5 (ವ), U+0CB9 (ಹ), U+0C97 (ಗ), U+0CA6 (ದ), U+0CB0 (ರ) are being reported as width 2, when they should be width 1.

This patch attempts to fix the issue, but many tests fail as a result (e.g. non-printing dfff had width 1).

I am happy to help resolve the broken tests, but am not sure of the best approach. Please advise.

stevengj · 2016-08-30T12:07:59Z

cc @jiahao … I thought we discussed the neutral-width case at some point?

joshuarubin · 2016-08-30T15:43:12Z

I made a small change so that it only forces the width to 1 if it already had a width. As a result, all tests pass again, and the characters I had issues with do now return width 1 as expected.

stevengj · 2016-08-30T16:07:38Z

@joshuarubin, I don't see where UAX#11 says that neutral characters have width 1. (It only says they map to halfwidth for legacy encodings. For rendering, it says "An implementation might therefore elect to treat them as ambiguous even though they are classified as neutral here.")

In #27, we elected to use the Unifont width for "neutral" characters, since this seems to be font dependent. Apparently, the characters you mention are width 2 in Unifont?

joshuarubin · 2016-08-30T16:14:30Z

Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but because for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below.

Yes, it seems that Unifont treats them as width 2. However, this decision is causing severe rendering problems on my mac where Terminal.app and iTerm2.app display them as width 1.

If there was some way for me to detect these characters externally and correct them, that would be sufficient too, but as far as I can tell there is nothing to distinguish them.

stevengj · 2016-08-30T16:39:02Z

Is Unifont in the clear minority among fonts here?

stevengj · 2016-08-30T16:40:02Z

See also the discussion in JuliaLang/julia#3721 .... if the font doesn't match what the terminal thinks the charwidth is, you are going to see problems regardless of what width we return.

stevengj · 2016-08-30T16:40:56Z

data/charwidths.jl

-        elseif width=="Na"|| width=="H" # narrow or half
+        elseif width=="Na" || width=="H" # narrow or half
+            CharWidths[c]=1
+        elseif width=="N" && haskey(CharWidths, c)


Maybe && get(CharWidths, c, 0) > 0

Yeah, I had that, but it didn't actually make a difference to the output. I can add it back, if you'd like.

joshuarubin · 2016-08-30T16:57:26Z

Is Unifont in the clear minority among fonts here?

A good question, I don't know how to answer that though...

jiahao · 2016-08-30T18:24:05Z

As @stevengj writes, Unicode explicitly avoids making recommendations for character widths. See JuliaLang/julia#3721 (comment) for a discussion, specifically the quote from UAX 11:

The East_Asian_Width is an informative property... the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

Most people unfortunately skip over this very important wording in the beginning of UAX 11.

Runes like U+0CA8 (ನ), U+0CB5 (ವ), U+0CB9 (ಹ), U+0C97 (ಗ), U+0CA6 (ದ), U+0CB0 (ರ) are being reported as width 2, when they should be width 1.

Could you explain why you say they should be width 1?
Note that the reference glyphs for the Kannada characters (not runes) in question (PDF) look very clearly like they are meant to be fullwidth. To look at just two:

Nevertheless, the reference glyphs are not meant to be authoritative in specifying character widths and individual fonts can do whatever they want. That's the main problem - it's not going to be possible to solve this problem in general for all fonts.

stevengj · 2016-08-30T18:33:36Z

(@jiahao, "rune" is Go terminology for a codepoint.)

joshuarubin · 2016-08-30T18:39:02Z

Could you explain why you say they should be width 1?

Frankly, I agree that U+0CA6 looks like it should be double width. The problem I have is that my terminals (iTerm2.app nightly [with Unicode 9 char widths disabled] and Terminal.app using a variety of fonts from Monaco to Hasklig) render it as single width.

It also renders symbols (flags, emoji) as single width despite clearly overflowing into the next column.

It is easy to override the output of utf8proc_charwidth for UTF8PROC_CATEGORY_SO, but I have no way of knowing if it is a neutral width character that should have width 1 instead of 2.

If I had a way to identify the neutral width characters in utf8proc, I could just handle it on my own.

joshuarubin · 2016-09-07T16:13:55Z

Anything I can do to help move this closer to a resolution?

stevengj · 2016-09-07T16:46:56Z

I'm not sure if there is any satisfactory resolution here. No matter what we do, there will always be buggy terminals that rely on out-of-date operating-system wcwidth functions, and fonts that disagree on character widths.

stevengj · 2016-09-07T16:47:53Z

For example, IIRC the Unicode 9 standard reclassified emoji as fullwidth, but terminals haven't caught up.

stevengj · 2016-09-07T16:50:03Z

If you want to match the OS's (probably out-of-date) wcwidth on characters where it is defined, you could do:

int mywidth(wchar_t c)
{
    int w = wcwidth(c);
    return w > 0 ? w : utf8proc_charwidth(c);
}

stevengj · 2016-09-07T16:54:47Z

@jiahao, since neither the fonts nor UAX11 are authoritative here, maybe we should err on the side of the "informative" UAX11 suggestions rather than Unifont, on the theory that UAX11 seems more likely to match what terminals do?

Keno · 2016-09-07T16:56:01Z

iTerm has unicode standard switching by proprietary escape code now.

joshuarubin · 2016-09-07T17:20:36Z

iTerm has unicode standard switching by proprietary escape code now.

I can't find that documented, do you have a link?

Keno · 2016-09-07T20:02:30Z

https://gitlab.com/gnachman/iterm2/wikis/unicodeversionswitching

jiahao · 2016-09-26T18:33:22Z

maybe we should err on the side of the "informative" UAX11 suggestions rather than Unifont, on the theory that UAX11 seems more likely to match what terminals do?

My concern here is that most terminals simply provide character widths that don't provide for a reasonable display of the characters in question. Even U+0ca6 above has a reference glyph that would suggest to me a fullwidth rather than halfwidth character. Since characters like these don't exist in the legacy East Asian character sets, I would think that the "default to narrow/halfwidth" behavior is a result more of neglect than an actual attempt to display these characters correctly in fixed width.

joshuarubin · 2016-11-18T20:44:10Z

Hi, this is still a problem. After considering this further, I think it may be more palatable, rather than merge this change, if there was a new function added that simply returned if a character width is ambiguous. It should return true for characters in the private use category and for the relevant east asian characters, otherwise false.

stevengj · 2016-11-18T20:48:05Z

@joshuarubin, that seems reasonable.

joshuarubin · 2016-11-18T22:54:27Z

well, after taking a look at what this would require, I certainly think it would be useful to have a way to know if characters have ambiguous width.

however, that would still not help the situation that this issue describes.

I would like to reiterate my support for overriding unifont in the case where the unicode standard says a character has neutral width but unifont renders it larger than 1 column.

The primary reason for this is that other systems implementing unicode character widths (e.g. terminals) will adhere to the spec rather than defer to any particular font choice.

Supporting documentation:

http://unicode.org/reports/tr11/#ED7

Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters [for East Asian character sets], but because for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na)

http://unicode.org/reports/tr11/#ED5

East Asian Narrow (Na): All other characters that are always narrow

iTerm2 character width tables:

https://github.com/gnachman/iTerm2/blob/master/sources/NSCharacterSet%2BiTerm.m

Keno · 2016-11-18T22:59:56Z

The fundamental problem here, is that everyone along the stack needs to agree on the character width tables. The problem is that there is no agreement on what those are, which is why iTerm added escape codes to switch the tables. It would have been nice not to be in this mess in the first place place, but since we are, I feel like iTerm's approach is the sanest choice. Whatever you do, you're bound to break something.

joshuarubin · 2016-11-18T23:25:26Z

I was under the impression utf8proc tried to implement unicode 9 only? I'm not asking for support for other versions.

stevengj · 2016-11-21T14:26:42Z

If your objective is "look good on terminals", I think @Keno's point is it's still going to break because most terminal software doesn't use up-to-date widths, unless you have something like iTerm.

joshuarubin · 2016-11-21T16:43:55Z

The objective is to support unicode 9 terminals including iTerm2 and rxvt. As more terminals will support the widths, the software will need to use them as well. There's a chicken and egg problem here. I'm trying to find ways to get support working across the board in projects like terminals, tmux, vim/neovim, etc. I've now gone and created a new project to help this endeavor, wcwidth9. While I really like utf8proc for the simplicity it offers in many unicode related tasks, I simply can't suggest it be used for calculating character widths.

Keno · 2016-11-21T17:20:30Z

The problem is that though e.g. iTerm2 treats characters like ದ as width 1, they are rendered as width 2. Now that iTerm2 supports character table switching, I would actually like to add a mode to it that has its width tables match its rendering. Unicode 9 is better in that regard, because it fixed the official definition for emoji widths, but it's still not perfect with respect to the rendering. The wcwidth function implemented in this library represents our best understanding how wide things will be when rendered on the screen, which is its objective. It does not implement "How wide will my terminal think this character is", which is important, but a different question, and in general impossible to answer, because you'd have to tell it what the terminal's width tables are. What I would be sympathetic to is adding an option that allows you to select which width table to use. Then applications that care enough can detect their terminals and pick an appropriate width table (hell we could even provide such a function in utf8proc). The problem with choosing a standard one is that somebody will be unhappy, no matter what the choice.

I do appreciate you working on this though. It's an important and non-trivial problem.

Also cc @gnachman who may be interested in this discussion.

stevengj · 2016-11-21T17:23:39Z

Isn't the operating system's (probably out-of-date) wcwidth function (when it returns a nonzero value) the most likely answer to "How wide will my terminal think this character is?"

Keno · 2016-11-21T17:28:24Z

Depends on whether the terminal emulator asks the os's (or rather libc's) wcwidth or has its own character tables. iTerm2 has its own certainly. I suspect Terminal.app is using the libc one. Not sure about rxvt (@joshuarubin?). Many other terminals don't handle double wide characters at all.

joshuarubin · 2016-11-21T17:34:13Z

Now that iTerm2 supports character table switching, I would actually like to add a mode to it that has its width tables match its rendering.

That's a very interesting idea, but it will also add to the confusion. I think some confusion while everything transitions to unicode9 is unavoidable at this point, but that the spec should be the guideline, not the font. If the terminal thinks it's width 1 and the software thinks it's width 1 but the font renders it as width 2, then that character will overflow into the next column on the display, but will not cause cascading rendering problems.

The wcwidth function implemented in this library represents our best understanding how wide things will be when rendered on the screen, which is its objective.

I agree that is a very useful datapoint, it's just not what I need now when implementing interfaces.

What I would be sympathetic to is adding an option that allows you to select which width table to use.

Again, I'll take that over nothing, but fear it might add long term confusion. As it is, I already need to know if a width is ambiguous (according to east asian context). I also need to know if a character is in the private use area which is ambiguous, but should not vary by east asian context. What's one more option...

Isn't the operating system's (probably out-of-date) wcwidth function (when it returns a nonzero value) the most likely answer to "How wide will my terminal think this character is?"

wcwidth on (now) MacOS is horribly broken. What's more, it has no way of knowing if the context is unicode9 or earlier. Even further, it is incapable (except perhaps via locale, and maybe that is the correct way to handle it) of determining the east asian context (or the width that private use area chars should be).

gnachman · 2016-11-21T20:57:31Z

The character width situation is quite a mess, isn't it? Fonts render glyphs at different sizes so forget about trying to lay out based on the visible size of the glyph. The East Asian width is your best bet even though it's imperfect. I'm happy to help improve the situation for iTerm2 users but there's not much left I can do. I could help you determine the availability of the proprietary table switching escape code at runtime.

stevengj · 2016-11-21T21:35:52Z

Of course wcwidth is broken. But that's probably the function that most terminal software will use to determine character widths, no?

joshuarubin · 2016-11-22T07:12:36Z

Of course wcwidth is broken. But that's probably the function that most terminal software will use to determine character widths, no?

The codebases I've looked at, including iTerm2, zsh, neovim and tmux have all replaced (either optionally or automatically through build-time platform identification or build-time tests) wcwidth in some form or another.

jakwings · 2018-04-01T06:01:33Z

I think the box drawing characters should have width 1 by default, which a lot of CLI programs assume. Many CJK fonts set their widths to 2, but I think this can be resolved by introducing new terminal settings.

set east asian neutral width to 1

9131844

joshuarubin mentioned this pull request Aug 30, 2016

use utf8proc on platforms with broken wcwidth tmux/tmux#524

Closed

only force neutral width to 1 if it had a width to begin with

d2d8969

stevengj reviewed Aug 30, 2016
View reviewed changes

joshuarubin mentioned this pull request Oct 10, 2016

tmux: enable utf8proc Homebrew/homebrew-core#5665

Closed

4 tasks

stevengj mentioned this pull request Feb 27, 2017

Merge charwidth and strwidth? JuliaLang/julia#20816

Closed

stevengj mentioned this pull request Sep 22, 2017

charwidth of U+2a1d (join) #114

Closed

stevengj mentioned this pull request May 2, 2018

Case conversion test fails on Alpine Linux #127

Closed

kghost mentioned this pull request Oct 15, 2018

Give API to measure the space that a string occupies microsoft/terminal#218

Open

stevengj mentioned this pull request Mar 30, 2019

give up on Unifont for charwidth data #150

Merged

stevengj closed this in #150 Mar 30, 2019

joshuarubin mentioned this pull request Nov 26, 2019

ambiguous widths joshuarubin/wcwidth9#3

Open

set east asian neutral width to 1 #83

set east asian neutral width to 1 #83

Conversation

joshuarubin commented Aug 30, 2016

stevengj commented Aug 30, 2016

joshuarubin commented Aug 30, 2016

stevengj commented Aug 30, 2016 • edited Loading

joshuarubin commented Aug 30, 2016

stevengj commented Aug 30, 2016

stevengj commented Aug 30, 2016 • edited Loading

stevengj Aug 30, 2016

Choose a reason for hiding this comment

joshuarubin Aug 30, 2016

Choose a reason for hiding this comment

joshuarubin commented Aug 30, 2016

jiahao commented Aug 30, 2016 • edited Loading

stevengj commented Aug 30, 2016

joshuarubin commented Aug 30, 2016 • edited Loading

joshuarubin commented Sep 7, 2016

stevengj commented Sep 7, 2016 • edited Loading

stevengj commented Sep 7, 2016

stevengj commented Sep 7, 2016 • edited Loading

stevengj commented Sep 7, 2016

Keno commented Sep 7, 2016

joshuarubin commented Sep 7, 2016

Keno commented Sep 7, 2016

jiahao commented Sep 26, 2016

joshuarubin commented Nov 18, 2016

stevengj commented Nov 18, 2016

joshuarubin commented Nov 18, 2016

Keno commented Nov 18, 2016

joshuarubin commented Nov 18, 2016

stevengj commented Nov 21, 2016

joshuarubin commented Nov 21, 2016

Keno commented Nov 21, 2016

stevengj commented Nov 21, 2016

Keno commented Nov 21, 2016 • edited Loading

joshuarubin commented Nov 21, 2016

gnachman commented Nov 21, 2016

stevengj commented Nov 21, 2016

joshuarubin commented Nov 22, 2016

jakwings commented Apr 1, 2018

stevengj commented Aug 30, 2016 •

edited

Loading

stevengj commented Aug 30, 2016 •

edited

Loading

jiahao commented Aug 30, 2016 •

edited

Loading

joshuarubin commented Aug 30, 2016 •

edited

Loading

stevengj commented Sep 7, 2016 •

edited

Loading

stevengj commented Sep 7, 2016 •

edited

Loading

Keno commented Nov 21, 2016 •

edited

Loading