Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set east asian neutral width to 1 #83

Closed
wants to merge 2 commits into from

Conversation

joshuarubin
Copy link

According to http://unicode.org/reports/tr11/#Recommendations, east asian neutral width characters should always map to either halfwidth or regular (narrow) characters.

Runes like U+0CA8 (ನ), U+0CB5 (ವ), U+0CB9 (ಹ), U+0C97 (ಗ), U+0CA6 (ದ), U+0CB0 (ರ) are being reported as width 2, when they should be width 1.

This patch attempts to fix the issue, but many tests fail as a result (e.g. non-printing dfff had width 1).

I am happy to help resolve the broken tests, but am not sure of the best approach. Please advise.

@stevengj
Copy link
Member

cc @jiahao … I thought we discussed the neutral-width case at some point?

@joshuarubin
Copy link
Author

I made a small change so that it only forces the width to 1 if it already had a width. As a result, all tests pass again, and the characters I had issues with do now return width 1 as expected.

@stevengj
Copy link
Member

stevengj commented Aug 30, 2016

@joshuarubin, I don't see where UAX#11 says that neutral characters have width 1. (It only says they map to halfwidth for legacy encodings. For rendering, it says "An implementation might therefore elect to treat them as ambiguous even though they are classified as neutral here.")

In #27, we elected to use the Unifont width for "neutral" characters, since this seems to be font dependent. Apparently, the characters you mention are width 2 in Unifont?

@joshuarubin
Copy link
Author

Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but because for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below.

Yes, it seems that Unifont treats them as width 2. However, this decision is causing severe rendering problems on my mac where Terminal.app and iTerm2.app display them as width 1.

If there was some way for me to detect these characters externally and correct them, that would be sufficient too, but as far as I can tell there is nothing to distinguish them.

@stevengj
Copy link
Member

Is Unifont in the clear minority among fonts here?

@stevengj
Copy link
Member

stevengj commented Aug 30, 2016

See also the discussion in JuliaLang/julia#3721 .... if the font doesn't match what the terminal thinks the charwidth is, you are going to see problems regardless of what width we return.

elseif width=="Na"|| width=="H" # narrow or half
elseif width=="Na" || width=="H" # narrow or half
CharWidths[c]=1
elseif width=="N" && haskey(CharWidths, c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe && get(CharWidths, c, 0) > 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had that, but it didn't actually make a difference to the output. I can add it back, if you'd like.

@joshuarubin
Copy link
Author

Is Unifont in the clear minority among fonts here?

A good question, I don't know how to answer that though...

@jiahao
Copy link
Collaborator

jiahao commented Aug 30, 2016

As @stevengj writes, Unicode explicitly avoids making recommendations for character widths. See JuliaLang/julia#3721 (comment) for a discussion, specifically the quote from UAX 11:

The East_Asian_Width is an informative property... the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

Most people unfortunately skip over this very important wording in the beginning of UAX 11.

Runes like U+0CA8 (ನ), U+0CB5 (ವ), U+0CB9 (ಹ), U+0C97 (ಗ), U+0CA6 (ದ), U+0CB0 (ರ) are being reported as width 2, when they should be width 1.

Could you explain why you say they should be width 1?
Note that the reference glyphs for the Kannada characters (not runes) in question (PDF) look very clearly like they are meant to be fullwidth. To look at just two:

screenshot 2016-08-30 11 19 49

Nevertheless, the reference glyphs are not meant to be authoritative in specifying character widths and individual fonts can do whatever they want. That's the main problem - it's not going to be possible to solve this problem in general for all fonts.

@stevengj
Copy link
Member

(@jiahao, "rune" is Go terminology for a codepoint.)

@joshuarubin
Copy link
Author

joshuarubin commented Aug 30, 2016

Could you explain why you say they should be width 1?

Frankly, I agree that U+0CA6 looks like it should be double width. The problem I have is that my terminals (iTerm2.app nightly [with Unicode 9 char widths disabled] and Terminal.app using a variety of fonts from Monaco to Hasklig) render it as single width.

It also renders symbols (flags, emoji) as single width despite clearly overflowing into the next column.

It is easy to override the output of utf8proc_charwidth for UTF8PROC_CATEGORY_SO, but I have no way of knowing if it is a neutral width character that should have width 1 instead of 2.

If I had a way to identify the neutral width characters in utf8proc, I could just handle it on my own.

@joshuarubin
Copy link
Author

Anything I can do to help move this closer to a resolution?

@stevengj
Copy link
Member

stevengj commented Sep 7, 2016

I'm not sure if there is any satisfactory resolution here. No matter what we do, there will always be buggy terminals that rely on out-of-date operating-system wcwidth functions, and fonts that disagree on character widths.

@stevengj
Copy link
Member

stevengj commented Sep 7, 2016

For example, IIRC the Unicode 9 standard reclassified emoji as fullwidth, but terminals haven't caught up.

@stevengj
Copy link
Member

stevengj commented Sep 7, 2016

If you want to match the OS's (probably out-of-date) wcwidth on characters where it is defined, you could do:

int mywidth(wchar_t c)
{
    int w = wcwidth(c);
    return w > 0 ? w : utf8proc_charwidth(c);
}

@stevengj
Copy link
Member

stevengj commented Sep 7, 2016

@jiahao, since neither the fonts nor UAX11 are authoritative here, maybe we should err on the side of the "informative" UAX11 suggestions rather than Unifont, on the theory that UAX11 seems more likely to match what terminals do?

@Keno
Copy link
Member

Keno commented Sep 7, 2016

iTerm has unicode standard switching by proprietary escape code now.

@joshuarubin
Copy link
Author

iTerm has unicode standard switching by proprietary escape code now.

I can't find that documented, do you have a link?

@Keno
Copy link
Member

Keno commented Sep 7, 2016

@jiahao
Copy link
Collaborator

jiahao commented Sep 26, 2016

maybe we should err on the side of the "informative" UAX11 suggestions rather than Unifont, on the theory that UAX11 seems more likely to match what terminals do?

My concern here is that most terminals simply provide character widths that don't provide for a reasonable display of the characters in question. Even U+0ca6 above has a reference glyph that would suggest to me a fullwidth rather than halfwidth character. Since characters like these don't exist in the legacy East Asian character sets, I would think that the "default to narrow/halfwidth" behavior is a result more of neglect than an actual attempt to display these characters correctly in fixed width.

@joshuarubin
Copy link
Author

Hi, this is still a problem. After considering this further, I think it may be more palatable, rather than merge this change, if there was a new function added that simply returned if a character width is ambiguous. It should return true for characters in the private use category and for the relevant east asian characters, otherwise false.

@stevengj
Copy link
Member

@joshuarubin, that seems reasonable.

@joshuarubin
Copy link
Author

well, after taking a look at what this would require, I certainly think it would be useful to have a way to know if characters have ambiguous width.

however, that would still not help the situation that this issue describes.

I would like to reiterate my support for overriding unifont in the case where the unicode standard says a character has neutral width but unifont renders it larger than 1 column.

The primary reason for this is that other systems implementing unicode character widths (e.g. terminals) will adhere to the spec rather than defer to any particular font choice.

Supporting documentation:

http://unicode.org/reports/tr11/#ED7

Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters [for East Asian character sets], but because for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na)

http://unicode.org/reports/tr11/#ED5

East Asian Narrow (Na): All other characters that are always narrow

iTerm2 character width tables:

https://github.com/gnachman/iTerm2/blob/master/sources/NSCharacterSet%2BiTerm.m

@Keno
Copy link
Member

Keno commented Nov 18, 2016

The fundamental problem here, is that everyone along the stack needs to agree on the character width tables. The problem is that there is no agreement on what those are, which is why iTerm added escape codes to switch the tables. It would have been nice not to be in this mess in the first place place, but since we are, I feel like iTerm's approach is the sanest choice. Whatever you do, you're bound to break something.

@joshuarubin
Copy link
Author

I was under the impression utf8proc tried to implement unicode 9 only? I'm not asking for support for other versions.

@stevengj
Copy link
Member

If your objective is "look good on terminals", I think @Keno's point is it's still going to break because most terminal software doesn't use up-to-date widths, unless you have something like iTerm.

@joshuarubin
Copy link
Author

The objective is to support unicode 9 terminals including iTerm2 and rxvt. As more terminals will support the widths, the software will need to use them as well. There's a chicken and egg problem here. I'm trying to find ways to get support working across the board in projects like terminals, tmux, vim/neovim, etc. I've now gone and created a new project to help this endeavor, wcwidth9. While I really like utf8proc for the simplicity it offers in many unicode related tasks, I simply can't suggest it be used for calculating character widths.

@Keno
Copy link
Member

Keno commented Nov 21, 2016

The problem is that though e.g. iTerm2 treats characters like as width 1, they are rendered as width 2. Now that iTerm2 supports character table switching, I would actually like to add a mode to it that has its width tables match its rendering. Unicode 9 is better in that regard, because it fixed the official definition for emoji widths, but it's still not perfect with respect to the rendering. The wcwidth function implemented in this library represents our best understanding how wide things will be when rendered on the screen, which is its objective. It does not implement "How wide will my terminal think this character is", which is important, but a different question, and in general impossible to answer, because you'd have to tell it what the terminal's width tables are. What I would be sympathetic to is adding an option that allows you to select which width table to use. Then applications that care enough can detect their terminals and pick an appropriate width table (hell we could even provide such a function in utf8proc). The problem with choosing a standard one is that somebody will be unhappy, no matter what the choice.

I do appreciate you working on this though. It's an important and non-trivial problem.

Also cc @gnachman who may be interested in this discussion.

@stevengj
Copy link
Member

Isn't the operating system's (probably out-of-date) wcwidth function (when it returns a nonzero value) the most likely answer to "How wide will my terminal think this character is?"

@Keno
Copy link
Member

Keno commented Nov 21, 2016

Depends on whether the terminal emulator asks the os's (or rather libc's) wcwidth or has its own character tables. iTerm2 has its own certainly. I suspect Terminal.app is using the libc one. Not sure about rxvt (@joshuarubin?). Many other terminals don't handle double wide characters at all.

@joshuarubin
Copy link
Author

Now that iTerm2 supports character table switching, I would actually like to add a mode to it that has its width tables match its rendering.

That's a very interesting idea, but it will also add to the confusion. I think some confusion while everything transitions to unicode9 is unavoidable at this point, but that the spec should be the guideline, not the font. If the terminal thinks it's width 1 and the software thinks it's width 1 but the font renders it as width 2, then that character will overflow into the next column on the display, but will not cause cascading rendering problems.

The wcwidth function implemented in this library represents our best understanding how wide things will be when rendered on the screen, which is its objective.

I agree that is a very useful datapoint, it's just not what I need now when implementing interfaces.

What I would be sympathetic to is adding an option that allows you to select which width table to use.

Again, I'll take that over nothing, but fear it might add long term confusion. As it is, I already need to know if a width is ambiguous (according to east asian context). I also need to know if a character is in the private use area which is ambiguous, but should not vary by east asian context. What's one more option...

Isn't the operating system's (probably out-of-date) wcwidth function (when it returns a nonzero value) the most likely answer to "How wide will my terminal think this character is?"

wcwidth on (now) MacOS is horribly broken. What's more, it has no way of knowing if the context is unicode9 or earlier. Even further, it is incapable (except perhaps via locale, and maybe that is the correct way to handle it) of determining the east asian context (or the width that private use area chars should be).

@gnachman
Copy link

The character width situation is quite a mess, isn't it? Fonts render glyphs at different sizes so forget about trying to lay out based on the visible size of the glyph. The East Asian width is your best bet even though it's imperfect. I'm happy to help improve the situation for iTerm2 users but there's not much left I can do. I could help you determine the availability of the proprietary table switching escape code at runtime.

@stevengj
Copy link
Member

Of course wcwidth is broken. But that's probably the function that most terminal software will use to determine character widths, no?

@joshuarubin
Copy link
Author

Of course wcwidth is broken. But that's probably the function that most terminal software will use to determine character widths, no?

The codebases I've looked at, including iTerm2, zsh, neovim and tmux have all replaced (either optionally or automatically through build-time platform identification or build-time tests) wcwidth in some form or another.

@jakwings
Copy link

jakwings commented Apr 1, 2018

I think the box drawing characters should have width 1 by default, which a lot of CLI programs assume. Many CJK fonts set their widths to 2, but I think this can be resolved by introducing new terminal settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants