-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functions to get the width in columns of a character #56777
Comments
Some characters take more than one column in a terminal, especially CJK (chinese, japanese, korean) characters. If you use such character in a terminal without taking care of the width in columns of each character, the text alignment can be broken. Issue bpo-2382 is an example of this problem. bpo-2382 and bpo-6755 have patches implementing such function:
Use test_ucs2w.py of bpo-6755 to test these new functions/methods. |
In the bpo-2382 code, how is the Windows case supposed to work? Also, what about systems that don't have wcswidth? IOW, the patch appears to be incorrect. I like the bpo-6755 approach better, except that it shouldn't be using hard-coded tables, but instead integrate with Python's version of the UCD. In addition, it should use an accepted, published strategy for determining the width, preferably coming from the Unicode consortium. |
I can attest that being able to get the columns of a grapheme cluster is very important for printing, because you need this to do correct linebreaking. There might be something you can steal from http://search.cpan.org/perldoc?Unicode::GCString which implements UAX#14 on linebreaking and UAX#11 on East Asian widths. I use this in my own code to help format Unicode strings my columns or lines. The right way would be to build this sort of knowledge into string.format(), but that is much harder, so an intermediary library module seems good enough for now. |
I don't think that Python should reinvent the wheel. We should just reuse wcswidth(). Here is a simple patch exposing wcswidth() function as locale.width(). Example: >>> import locale
>>> text = '\u3042\u3044\u3046\u3048\u304a'
>>> len(text)
5
>>> locale.width(text)
10
>>> locale.width(' ')
1
>>> locale.width('\U0010abcd')
1
>>> locale.width('\uDC80')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
locale.Error: the string is not printable
>>> locale.width('\U0010FFFF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
locale.Error: the string is not printable I don't think that we need locale.width() on Windows because its console has already bigger issues with Unicode: see issue bpo-1602. If you want to display correctly non-ASCII characters on Windows, just avoid the Windows console and use a graphical widget. |
Oh, unicode_width.patch of issue bpo-2382 implements the width on Windows using: WideCharToMultiByte(CP_ACP, 0, buf, len, NULL, 0, NULL, NULL); It computes the length of byte string encoded to the ANSI code page. I don't know if it can be seen as the "width" of a character string in the console... |
I think the WideCharToMultibyte approach is just incorrect. I'm -1 on using wcswidth, though. We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/
|
Like you, I too seriously question using wcswidth() for this at all:
I would be willing to bet (a small amount of) money it does not correctly
There are a bunch of "interesting" cases I would want it tested against.
Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this. For example, some of the Marks are EA=A and some are EA=N, yet how may Now consider the many \pC code points, like
A TAB is its own problem but SHY we know is only width=1 immediately Context: Imagine you're trying to format a string so that it takes up exactly 20 I really do think that what bpo-12568 is asking for is to have the equivalent I may of course be wrong. --tom |
When you write text into a console on Linux (e.g. displayed by gnome-terminal or konsole), I suppose that wcswidth() can be used to compute the width of a line. It would help to fix bpo-2382. Or do you think that wcswidth() gives the wrong result for this use case? |
No, I think that using it is not necessary. If you want to compute the |
Could we have an update on the status of this? I ask because if 3.3 is going to (finally) fix unicode for curses, it would be really nice if it were possible to calculate the width of what's being displayed! It looks as if there was never quite agreement on the proper API.... |
Nicholas: I consider this issue fixed. There already *is* any API to compute the width of a character. Closing this as "works for me". |
Martin: sorry to be completely dense, but I can't get this to work properly with the python3.3a1 build. Could you post some example code? |
Please see the attached width.py for an example |
Martin, I think you meant to write "if w == 'A':". http://unicode.org/reports/tr11/ says: So in practice applications can treat ambiguous characters as narrow by default, with a user setting to use legacy (wide) width. As Tom pointed out there are also a bunch of zero width characters, and characters with special formatting like tab, soft hyphen, ... |
I would encourage you to look at the Perl CPAN module Unicode::LineBreak, If you'd like, I can show you a program that uses these, a rewrite the --tom |
Marting and Poq: I think the sample code shows up a real problem. "Ambiguous" characters according to unicode may be rendered by curses in different ways. Don't we need a function that actually reports how curses is going to print a given string, rather than just reporting what the unicode standard says? |
That's precisely why I don't think this should be in the library, but |
Thanks for the pointer!
I believe there can't be any truly "proper" implementation, as you |
The column-width of a string is not an application issue. It is This is not an applications issue -- at all. --tom |
That may be useful, but a) this patch doesn't provide that, and To put my closing this issue differently: I rejected the patch that |
Hm. I think we may not be talking about the same thing after all. If we're talking about the Curses library, or something similar, However, Unicode does, and defines the column width for those. I have an illustration of what this looks like in the picture
That is what I have been talking about by print widths. It's running Are we talking about different things? --tom |
It seems this is a bit of a minefield... GNOME Terminal/libvte has an environment variable (VTE_CJK_WIDTH) to override the handling of ambiguous width characters. It bases its default on the locale (with the comment 'This is basically what GNU libc does'). urxvt just uses system wcwidth. Xterm uses some voodoo to decide between system wcwidth and mk_wcwidth(_cjk): http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c I think the simplest solution is to just expose libc's wc(s)width. It is widely used and is most likely to match the behaviour of the terminal. FWIW I wrote a little script to test the widths of all Unicode characters, and came up with the following logic to match libvte behaviour: def wcwidth(c, legacy_cjk=False):
if c in u'\t\r\n\10\13\14': raise ValueError('character %r has no intrinsic width' % c)
if c in u'\0\5\7\16\17': return 0
if u'\u1160' <= c <= u'\u11ff': return 0 # hangul jamo
if unicodedata.category(c) in ('Mn', 'Me', 'Cf') and c != u'\u00ad': return 0 # 00ad = soft hyphen
eaw = unicodedata.east_asian_width(c)
if eaw in ('F', 'W'): return 2
if legacy_cjk and eaw == 'A': return 2
return 1 |
Tom: I don't think Unicode::GCString implements UAX#11 correctly (but this is really out of scope of this issue). In particular, it contains an ad-hoc decision to introduce the EA_Z east-asian width that UAX#11 doesn't talk about. In most cases, it's probably reasonable to introduce this EA_Z feature. However, there are some significant deviations from UAX#11 here:
|
poq: I still remain opposed to exposing wcswidth, since it is just as incorrect as any of the other solutions that people have circulated. I could agree to it if it was called "wcswidth", making it clear that it does whatever the C library does, with whatever semantics the C library wants to give to it (and an availability that depends on whether the C library supports it or not). That would probably cover the nurses use cases, except that it is not only incorrect with respect to Unicode, but also incorrect with respect to what the terminal may be doing. I guess users would use it anyway. For Python's internal use, I could accept using the sombok algorithm. I wouldn't expose it, since it again would trick people into believing that it was correct in some sense. Perhaps calling it sombok_width might allow for exposing it. |
Martin, I agree that wcswidth is incorrect with respect to Unicode. However I don't think that's relevant at all. Python should only try to match the behaviour of the terminal. Since terminals do slightly different things, trying to match them exactly - in all cases, on all systems - is virtually impossible. But AFAICT wcwidth should match the terminal behaviour on nearly all modern systems, so it makes sense to expose it. |
Poq: I agree. Guessing from the Unicode standard is going to lead to users having to write some complicated code that people are going have to reinvent over and over, and is not going to be accurate with respect to curses. I'd favour exposing wcwidth. Martin: I agree that there are going to be cases where it is not correct because the terminal does something strange, but what we need is something that gets as close as possible to what the terminal is likely to be doing (the Unicode standard itself is not really the issue for curses stuff). So whether it is called wcwidth or wcswidth I don't really mind, but I think it would be useful. The other alternative is to include one of the other ideas that have been mentioned in this thread as part of the library, I suppose, so that people don't have to keep reinventing the wheel for themselves. The one thing I really don't favour is shipping something that supports wide characters, but gives the users no way of guessing whether or not that is what they are printing, because that is surely going to break a lot of applications. |
Can't we expose wcswidth() as locale.strwidth() with a recipe explaining how to use unicodedata to get a "correct" result? At least until everyone implements correctly Unicode and Unicode stops evolving? :-) -- For unicodedata, a function to get the width of a string would be more convinient than unicodedata.east_asian_width(): >>> import unicodedata
>>> unicodedata.east_asian_width('abc')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: need a single Unicode character as parameter
>>> 'abc'.ljust(unicodedata.east_asian_width(' '))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer The function posted in msg155361 looks like east_asian_width() is not enough to get the width in columns of a single character. |
Has anyone tested wcswidth on FreeBSD, old Solaris? With non-utf8 locales? |
In this part of width.py, I presume that 'c' should be 'w'. |
Since no consensus was found on the definition of the function, and this issue has no activity since 2 years, I close the issue as out of date. |
I think this function would be very useful in many parts of interpreter core and standard library. From displaying tracebacks to formatting helps. Otherwise we are doomed to implement imperfect variants in multiple places. |
Since we failed to agree on this feature, I close the issue. |
Interestingly, this just came up again in bpo-30717. |
I removed the dependency from bpo-24665 (CJK support for textwrap) to this issue, since its current PR uses unicodedata.east_asian_width(), not the C function wcswidth(). |
You need users who use CJK and understand locale issues especially the width of characters. Ask maybe Xiang Zhang and Naoki INADA? |
Hello, I come from bugs.python.org/issue30717 . I have a pending PR that needs review ( #2673 ) adding a function that breaks unicode strings into grapheme clusters (aka what one would intuitively call "a character"). It's based on the grapheme cluster breaking algorithm from TR29. Let me know if this is of any relevance. Quick demo:
>>> a=unicodedata.break_graphemes("lol")
>>> list(a)
['l', 'o', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309l"))
['l', 'ỏ', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309\u0301l"))
['l', 'ỏ́', 'l']
>>> list(unicodedata.break_graphemes("lo\u0301l"))
['l', 'ó', 'l']
>>> list(unicodedata.break_graphemes(""))
[] |
I suggest reclosing this issue, for the same reason I suggested closure of bpo-24665 in msg321291: abstract unicode 'characters' (graphemes) do not, in general, have fixed physical widths of 0, 1, or 2 n-pixel columns (or spaces). I based that fairly long message on IDLE's multiscript font sample as displayed on Windows 10. In that context, for instance, the width of (fixed-pitch) East Asian characters is about 1.6, not 2.0, times the width of fixed-pitch Ascii characters. Variable-width Tamil characters average about the same. The exact ratio depends on the Latin font used. I did more experiments with Python started from Command Prompt with code page 437 or 65001 and characters 20 pixels high. The Windows console only allows 'fixed pitch' fonts. East Asian characters, if displayed, are expanded to double width. However, European characters are not reliably displayed in one column. The width depends on the both the font selected when a character is entered and the current font. The 20 latin1 characters in '¢£¥§©«®¶½ĞÀÁÂÃÄÅÇÐØß' usually display in 20 columns. But if they are entered with the font set to MSGothic, the '§' and '¶' are each displayed in the middle of 2 columns, for 22 total. If the font is changed to MSGothic after entry, the '§' and '¶' are shifted 1/2 column right to overlap the following '©' or '½' without changing the total width. Greek and Cyrillic characters also sometimes take two columns. I did not test whether the font size (pixel height) affects horizontal column spacing. |
I close the issue as WONTFIX. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: