Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions to get the width in columns of a character #56777

Closed
vstinner opened this issue Jul 14, 2011 · 39 comments
Closed

Add functions to get the width in columns of a character #56777

vstinner opened this issue Jul 14, 2011 · 39 comments
Labels
3.8 (EOL) end of life topic-unicode type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

BPO 12568
Nosy @malemburg, @loewis, @terryjreedy, @vstinner, @benjaminp, @ezio-melotti, @merwok, @bitdancer, @serhiy-storchaka, @Vermeille, @ishigoya, @bianjp
Files
  • locale_width.patch
  • width.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-11-08.21:45:08.936>
    created_at = <Date 2011-07-14.22:43:56.356>
    labels = ['type-feature', '3.8', 'expert-unicode']
    title = 'Add functions to get the width in columns of a character'
    updated_at = <Date 2018-11-08.21:45:08.934>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2018-11-08.21:45:08.934>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-11-08.21:45:08.936>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2011-07-14.22:43:56.356>
    creator = 'vstinner'
    dependencies = []
    files = ['23401', '24773']
    hgrepos = []
    issue_num = 12568
    keywords = ['patch']
    message_count = 39.0
    messages = ['140376', '140488', '141936', '145497', '145498', '145523', '145535', '145748', '145778', '155223', '155236', '155307', '155313', '155323', '155324', '155337', '155342', '155343', '155344', '155345', '155346', '155361', '155370', '155373', '155379', '155382', '156337', '156348', '181149', '238425', '255421', '297129', '297489', '297492', '297564', '297569', '298322', '323731', '329488']
    nosy_count = 19.0
    nosy_names = ['lemburg', 'loewis', 'terry.reedy', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'eric.araujo', 'Arfrever', 'r.david.murray', 'inigoserna', 'zeha', 'poq', 'Nicholas.Cole', 'tchrist', 'serhiy.storchaka', 'Socob', 'Guillaume Sanchez', 'ishigoya', 'bianjp']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue12568'
    versions = ['Python 3.8']

    @vstinner
    Copy link
    Member Author

    Some characters take more than one column in a terminal, especially CJK (chinese, japanese, korean) characters. If you use such character in a terminal without taking care of the width in columns of each character, the text alignment can be broken. Issue bpo-2382 is an example of this problem.

    bpo-2382 and bpo-6755 have patches implementing such function:

    • unicode_width.patch of bpo-2382 adds unicode.width() method
    • ucs2w.c of bpo-6755 creates a new ucs2w module with two functions: unichr2w() (width of a character) and ucs2w() (width of a string)

    Use test_ucs2w.py of bpo-6755 to test these new functions/methods.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 16, 2011

    In the bpo-2382 code, how is the Windows case supposed to work? Also, what about systems that don't have wcswidth? IOW, the patch appears to be incorrect.

    I like the bpo-6755 approach better, except that it shouldn't be using hard-coded tables, but instead integrate with Python's version of the UCD. In addition, it should use an accepted, published strategy for determining the width, preferably coming from the Unicode consortium.

    @tchrist
    Copy link
    Mannequin

    tchrist mannequin commented Aug 12, 2011

    I can attest that being able to get the columns of a grapheme cluster is very important for printing, because you need this to do correct linebreaking. There might be something you can steal from

    http://search.cpan.org/perldoc?Unicode::GCString
    http://search.cpan.org/perldoc?Unicode::LineBreak

    which implements UAX#14 on linebreaking and UAX#11 on East Asian widths.

    I use this in my own code to help format Unicode strings my columns or lines. The right way would be to build this sort of knowledge into string.format(), but that is much harder, so an intermediary library module seems good enough for now.

    @vstinner
    Copy link
    Member Author

    There might be something you can steal from ...

    I don't think that Python should reinvent the wheel. We should just reuse wcswidth().

    Here is a simple patch exposing wcswidth() function as locale.width().

    Example:

    >>> import locale
    >>> text = '\u3042\u3044\u3046\u3048\u304a'
    >>> len(text)
    5
    >>> locale.width(text)
    10
    >>> locale.width(' ')
    1
    >>> locale.width('\U0010abcd')
    1
    >>> locale.width('\uDC80')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    locale.Error: the string is not printable
    >>> locale.width('\U0010FFFF')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    locale.Error: the string is not printable

    I don't think that we need locale.width() on Windows because its console has already bigger issues with Unicode: see issue bpo-1602. If you want to display correctly non-ASCII characters on Windows, just avoid the Windows console and use a graphical widget.

    @vstinner
    Copy link
    Member Author

    Oh, unicode_width.patch of issue bpo-2382 implements the width on Windows using:

    WideCharToMultiByte(CP_ACP, 0, buf, len, NULL, 0, NULL, NULL);

    It computes the length of byte string encoded to the ANSI code page. I don't know if it can be seen as the "width" of a character string in the console...

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 14, 2011

    I think the WideCharToMultibyte approach is just incorrect.

    I'm -1 on using wcswidth, though. We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/
    The outcomes of this function are these:

    • F: full-width, width 2, compatibility character for a narrow char
    • H: half-width, width 1, compatibility character for a narrow char
    • W: wide, width 2
    • Na: narrow, width 1
    • A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
    • N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1

    @tchrist
    Copy link
    Mannequin

    tchrist mannequin commented Oct 14, 2011

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    I think the WideCharToMultibyte approach is just incorrect.

    I'm -1 on using wcswidth, though.

    Like you, I too seriously question using wcswidth() for this at all:

    The wcswidth() function either shall return 0 (if pwcs points to a
    null wide-character code), or return the number of column positions
    to be occupied by the wide-character string pointed to by pwcs, or
    return -1 (if any of the first n wide-character codes in the wide-
    character string pointed to by pwcs is not a printable wide-
    character code).
    

    I would be willing to bet (a small amount of) money it does not correctly
    inplmented Unicode print widths, even though one would certainly *think* it
    does according to this:

     The wcswidth() function determines the number of column positions
     required for the first n characters of pwcs, or until a null wide
     character (L'\0') is encountered.
    

    There are a bunch of "interesting" cases I would want it tested against.

    We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/

    The outcomes of this function are these:

    • F: full-width, width 2, compatibility character for a narrow char
    • H: half-width, width 1, compatibility character for a narrow char
    • W: wide, width 2
    • Na: narrow, width 1
    • A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
    • N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1

    Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this.
    And EA=N cannot be consider 1, either.

    For example, some of the Marks are EA=A and some are EA=N, yet how may
    print columns they take varies. It is usually 0, but can be 1 at the start
    of the file/string or immediately after a linebreak sequence. Then there
    are things like the variation selectors which are never anything.

    Now consider the many \pC code points, like

    U+0009  CHARACTER TABULATION
    U+00AD  SOFT HYPHEN 
    U+200C  ZERO WIDTH NON-JOINER
    U+FEFF  ZERO WIDTH NO-BREAK SPACE
    U+2062  INVISIBLE TIMES
    

    A TAB is its own problem but SHY we know is only width=1 immediately
    before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly
    width=0. So are the INVISIBLE * code points.

    Context:

    Imagine you're trying to format a string so that it takes up exactly 20
    columns: you need to know how many spaces to pad it with based on the
    print width. That is what the bpo-12568 is needing
    to do, and you have to do much more than East Asian Width properties.

    I really do think that what bpo-12568 is asking for is to have the equivalent
    of the Perl Unicode::GCString's columns() method, and that you aren't going
    to be able to handle text alignment of Unicode with anything that is much
    less of that. After all, bpo-12568's title is "Add functions to get the width
    in columns of a character". I would very much like to compare what
    columns() thinks compared with what wcswidth() thinks. I bet wcswidth() is
    very simple-minded at best.

    I may of course be wrong.

    --tom

    @vstinner
    Copy link
    Member Author

    I'm -1 on using wcswidth, though.

    When you write text into a console on Linux (e.g. displayed by gnome-terminal or konsole), I suppose that wcswidth() can be used to compute the width of a line. It would help to fix bpo-2382.

    Or do you think that wcswidth() gives the wrong result for this use case?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 18, 2011

    > I'm -1 on using wcswidth, though.

    When you write text into a console on Linux (e.g. displayed by
    gnome-terminal or konsole), I suppose that wcswidth() can be used to
    compute the width of a line. It would help to fix bpo-2382.

    Or do you think that wcswidth() gives the wrong result for this use
    case?

    No, I think that using it is not necessary. If you want to compute the
    width of a line, use unicodedata.east_asian_width. And yes, wcswidth
    may sometimes produce "incorrect" results (although it's probably
    correct most of the time).

    @NicholasCole
    Copy link
    Mannequin

    NicholasCole mannequin commented Mar 9, 2012

    Could we have an update on the status of this? I ask because if 3.3 is going to (finally) fix unicode for curses, it would be really nice if it were possible to calculate the width of what's being displayed! It looks as if there was never quite agreement on the proper API....

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 9, 2012

    Nicholas: I consider this issue fixed. There already *is* any API to compute the width of a character. Closing this as "works for me".

    @loewis loewis mannequin closed this as completed Mar 9, 2012
    @NicholasCole
    Copy link
    Mannequin

    NicholasCole mannequin commented Mar 10, 2012

    Martin: sorry to be completely dense, but I can't get this to work properly with the python3.3a1 build. Could you post some example code?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 10, 2012

    Please see the attached width.py for an example

    @poq
    Copy link
    Mannequin

    poq mannequin commented Mar 10, 2012

    Martin, I think you meant to write "if w == 'A':".
    Some very common characters have ambiguous widths though (e.g. the Greek alphabet), so you can't just raise an error for them.

    http://unicode.org/reports/tr11/ says:
    "Ambiguous characters occur in East Asian legacy character sets as wide characters, but as narrow (i.e., normal-width) characters in non-East Asian usage."

    So in practice applications can treat ambiguous characters as narrow by default, with a user setting to use legacy (wide) width.

    As Tom pointed out there are also a bunch of zero width characters, and characters with special formatting like tab, soft hyphen, ...

    @tchrist
    Copy link
    Mannequin

    tchrist mannequin commented Mar 10, 2012

    I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
    which fully implements tr11. It includes Unicode::GCString, a class
    that has a columns() method to determine the print columns. This is very
    fancy in the case of Asian widths, but of course there are many other cases too.

    If you'd like, I can show you a program that uses these, a rewrite the
    standard Unix fmt(1) filter that works properly on Unicode column widths.

    --tom

    @NicholasCole
    Copy link
    Mannequin

    NicholasCole mannequin commented Mar 10, 2012

    Marting and Poq: I think the sample code shows up a real problem. "Ambiguous" characters according to unicode may be rendered by curses in different ways.

    Don't we need a function that actually reports how curses is going to print a given string, rather than just reporting what the unicode standard says?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 10, 2012

    Martin, I think you meant to write "if w == 'A':".
    Some very common characters have ambiguous widths though (e.g. the Greek alphabet), so you can't just raise an error for them.

    That's precisely why I don't think this should be in the library, but
    in the application. Application developers who need that also need
    to concern themselves with the border cases, and decide on how
    they need to resolve them.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 10, 2012

    I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
    which fully implements tr11.

    Thanks for the pointer!

    If you'd like, I can show you a program that uses these, a rewrite the
    standard Unix fmt(1) filter that works properly on Unicode column widths.

    I believe there can't be any truly "proper" implementation, as you
    can't be certain how the terminal will handle these itself. In any
    case, anybody who is interested in contributing a patch should also
    be capable of understanding the source of Unicode::LineBreak.

    @tchrist
    Copy link
    Mannequin

    tchrist mannequin commented Mar 10, 2012

    Martin v. L=C3=B6wis <martin@v.loewis.de> added the comment:

    > Martin, I think you meant to write "if w =3D=3D 'A':".
    > Some very common characters have ambiguous widths though (e.g. the Greek =
    alphabet), so you can't just raise an error for them.

    That's precisely why I don't think this should be in the library, but
    in the application. Application developers who need that also need
    to concern themselves with the border cases, and decide on how
    they need to resolve them.

    The column-width of a string is not an application issue. It is
    well-defined by Unicode. Again, please see how we've done it in
    Perl, where tr11 is fully implemented. The columns() method from
    Unicode::GCString always gives the right answer per the Standard for
    any string, even what you are calling ambiguous ones.

    This is not an applications issue -- at all.

    --tom

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 10, 2012

    Don't we need a function that actually reports how curses is going to
    print a given string, rather than just reporting what the unicode
    standard says?

    That may be useful, but

    a) this patch doesn't provide that, and
    b) it may not actually possible to implement such a change in a portable
    way as there may be no function exposed by the curses implementation
    that provides this information.

    To put my closing this issue differently: I rejected the patch that
    Victor initially submitted. If anybody wants to contribute a different
    patch that uses a different strategy, please submit a new issue.

    @pitrou pitrou reopened this Mar 10, 2012
    @tchrist
    Copy link
    Mannequin

    tchrist mannequin commented Mar 10, 2012

    Martin v. L=C3=B6wis <martin@v.loewis.de> added the comment:

    > I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
    > which fully implements tr11.

    Thanks for the pointer!

    > If you'd like, I can show you a program that uses these, a rewrite the
    > standard Unix fmt(1) filter that works properly on Unicode column widths.

    I believe there can't be any truly "proper" implementation, as you
    can't be certain how the terminal will handle these itself.

    Hm. I think we may not be talking about the same thing after all.

    If we're talking about the Curses library, or something similar,
    this is not the same. I do not think Curses has support for
    combining characters, right to left text, wide characters, etc.

    However, Unicode does, and defines the column width for those.

    I have an illustration of what this looks like in the picture
    in the very last recipe, #44, in

    http://training.perl.com/scripts/perlunicook.html
    

    That is what I have been talking about by print widths. It's running
    in a Mac terminal emulator, and unlike the HTML which grabs from too
    many fonts, the terminal program does the right thing with the widths.

    Are we talking about different things?

    --tom

    @poq
    Copy link
    Mannequin

    poq mannequin commented Mar 11, 2012

    It seems this is a bit of a minefield...

    GNOME Terminal/libvte has an environment variable (VTE_CJK_WIDTH) to override the handling of ambiguous width characters. It bases its default on the locale (with the comment 'This is basically what GNU libc does').

    urxvt just uses system wcwidth.

    Xterm uses some voodoo to decide between system wcwidth and mk_wcwidth(_cjk): http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

    I think the simplest solution is to just expose libc's wc(s)width. It is widely used and is most likely to match the behaviour of the terminal.

    FWIW I wrote a little script to test the widths of all Unicode characters, and came up with the following logic to match libvte behaviour:

    def wcwidth(c, legacy_cjk=False):
    	if c in u'\t\r\n\10\13\14': raise ValueError('character %r has no intrinsic width' % c)
    	if c in u'\0\5\7\16\17': return 0
    	if u'\u1160' <= c <= u'\u11ff': return 0 # hangul jamo
    	if unicodedata.category(c) in ('Mn', 'Me', 'Cf') and c != u'\u00ad': return 0 # 00ad = soft hyphen
    	eaw = unicodedata.east_asian_width(c)
    	if eaw in ('F', 'W'): return 2
    	if legacy_cjk and eaw == 'A': return 2
    	return 1

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 11, 2012

    Tom: I don't think Unicode::GCString implements UAX#11 correctly (but this is really out of scope of this issue). In particular, it contains an ad-hoc decision to introduce the EA_Z east-asian width that UAX#11 doesn't talk about.

    In most cases, it's probably reasonable to introduce this EA_Z feature. However, there are some significant deviations from UAX#11 here:

    • combining characters are given EA_Z in sombok/data/custom.pl, even though UAX#11 assigns A or N. UAX#11 points out that the advance width depends on whether or not the terminal performs character combination or not. It's not clear whether Unicode::GCString aims for "strict" UAX#11, or "advance width".
    • control characters are also given EA_Z, even though UAX#11 gives them EA_N. In this case, it's neither UAX#11 width nor advance width since control characters will have various effects on the terminal (in particular for the tab character)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 11, 2012

    poq: I still remain opposed to exposing wcswidth, since it is just as incorrect as any of the other solutions that people have circulated. I could agree to it if it was called "wcswidth", making it clear that it does whatever the C library does, with whatever semantics the C library wants to give to it (and an availability that depends on whether the C library supports it or not).

    That would probably cover the nurses use cases, except that it is not only incorrect with respect to Unicode, but also incorrect with respect to what the terminal may be doing. I guess users would use it anyway.

    For Python's internal use, I could accept using the sombok algorithm. I wouldn't expose it, since it again would trick people into believing that it was correct in some sense. Perhaps calling it sombok_width might allow for exposing it.

    @poq
    Copy link
    Mannequin

    poq mannequin commented Mar 11, 2012

    Martin,

    I agree that wcswidth is incorrect with respect to Unicode. However I don't think that's relevant at all. Python should only try to match the behaviour of the terminal.

    Since terminals do slightly different things, trying to match them exactly - in all cases, on all systems - is virtually impossible. But AFAICT wcwidth should match the terminal behaviour on nearly all modern systems, so it makes sense to expose it.

    @NicholasCole
    Copy link
    Mannequin

    NicholasCole mannequin commented Mar 11, 2012

    Poq: I agree. Guessing from the Unicode standard is going to lead to users having to write some complicated code that people are going have to reinvent over and over, and is not going to be accurate with respect to curses. I'd favour exposing wcwidth.

    Martin: I agree that there are going to be cases where it is not correct because the terminal does something strange, but what we need is something that gets as close as possible to what the terminal is likely to be doing (the Unicode standard itself is not really the issue for curses stuff). So whether it is called wcwidth or wcswidth I don't really mind, but I think it would be useful.

    The other alternative is to include one of the other ideas that have been mentioned in this thread as part of the library, I suppose, so that people don't have to keep reinventing the wheel for themselves.

    The one thing I really don't favour is shipping something that supports wide characters, but gives the users no way of guessing whether or not that is what they are printing, because that is surely going to break a lot of applications.

    @vstinner
    Copy link
    Member Author

    Martin: I agree that there are going to be cases where it is not
    correct because the terminal does something strange, but what we
    need is something that gets as close as possible to what the
    terminal is likely to be doing

    Can't we expose wcswidth() as locale.strwidth() with a recipe explaining how to use unicodedata to get a "correct" result? At least until everyone implements correctly Unicode and Unicode stops evolving? :-)

    --

    For unicodedata, a function to get the width of a string would be more convinient than unicodedata.east_asian_width():

    >>> import unicodedata
    >>> unicodedata.east_asian_width('abc')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: need a single Unicode character as parameter
    >>> 'abc'.ljust(unicodedata.east_asian_width(' '))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'str' object cannot be interpreted as an integer

    The function posted in msg155361 looks like east_asian_width() is not enough to get the width in columns of a single character.

    @serhiy-storchaka
    Copy link
    Member

    Has anyone tested wcswidth on FreeBSD, old Solaris? With non-utf8 locales?

    @terryjreedy terryjreedy added the type-feature A feature request or enhancement label Feb 2, 2013
    @terryjreedy
    Copy link
    Member

    In this part of width.py,
    w = unicodedata.east_asian_width(c)
    if c == 'A':
    # ambiguous
    raise ValueError("ambiguous character %x" % (ord(c)))

    I presume that 'c' should be 'w'.

    @vstinner
    Copy link
    Member Author

    Since no consensus was found on the definition of the function, and this issue has no activity since 2 years, I close the issue as out of date.

    @serhiy-storchaka
    Copy link
    Member

    I think this function would be very useful in many parts of interpreter core and standard library. From displaying tracebacks to formatting helps.

    Otherwise we are doomed to implement imperfect variants in multiple places.

    @vstinner
    Copy link
    Member Author

    Since we failed to agree on this feature, I close the issue.

    @bitdancer
    Copy link
    Member

    Interestingly, this just came up again in bpo-30717.

    @serhiy-storchaka
    Copy link
    Member

    At least two other issues depend on this: bpo-17048 and bpo-24665.

    If Victor lost interest in this issue I take it. I'm going to push at least imperfect solution which may be improved in time.

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Jul 1, 2017
    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 3, 2017

    At least two other issues depend on this: bpo-17048 and bpo-24665.

    I removed the dependency from bpo-24665 (CJK support for textwrap) to this issue, since its current PR uses unicodedata.east_asian_width(), not the C function wcswidth().

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 3, 2017

    You need users who use CJK and understand locale issues especially the width of characters. Ask maybe Xiang Zhang and Naoki INADA?

    @Vermeille
    Copy link
    Mannequin

    Vermeille mannequin commented Jul 13, 2017

    Hello,

    I come from bugs.python.org/issue30717 . I have a pending PR that needs review ( #2673 ) adding a function that breaks unicode strings into grapheme clusters (aka what one would intuitively call "a character"). It's based on the grapheme cluster breaking algorithm from TR29.

    Let me know if this is of any relevance.

    Quick demo:
    >>> a=unicodedata.break_graphemes("lol")
    >>> list(a)
    ['l', 'o', 'l']
    >>> list(unicodedata.break_graphemes("lo\u0309l"))
    ['l', 'ỏ', 'l']
    >>> list(unicodedata.break_graphemes("lo\u0309\u0301l"))
    ['l', 'ỏ́', 'l']
    >>> list(unicodedata.break_graphemes("lo\u0301l"))
    ['l', 'ó', 'l']
    >>> list(unicodedata.break_graphemes(""))
    []

    @terryjreedy
    Copy link
    Member

    I suggest reclosing this issue, for the same reason I suggested closure of bpo-24665 in msg321291: abstract unicode 'characters' (graphemes) do not, in general, have fixed physical widths of 0, 1, or 2 n-pixel columns (or spaces). I based that fairly long message on IDLE's multiscript font sample as displayed on Windows 10. In that context, for instance, the width of (fixed-pitch) East Asian characters is about 1.6, not 2.0, times the width of fixed-pitch Ascii characters. Variable-width Tamil characters average about the same. The exact ratio depends on the Latin font used.

    I did more experiments with Python started from Command Prompt with code page 437 or 65001 and characters 20 pixels high. The Windows console only allows 'fixed pitch' fonts. East Asian characters, if displayed, are expanded to double width.

    However, European characters are not reliably displayed in one column. The width depends on the both the font selected when a character is entered and the current font. The 20 latin1 characters in '¢£¥§©«®¶½ĞÀÁÂÃÄÅÇÐØß' usually display in 20 columns. But if they are entered with the font set to MSGothic, the '§' and '¶' are each displayed in the middle of 2 columns, for 22 total. If the font is changed to MSGothic after entry, the '§' and '¶' are shifted 1/2 column right to overlap the following '©' or '½' without changing the total width. Greek and Cyrillic characters also sometimes take two columns.

    I did not test whether the font size (pixel height) affects horizontal column spacing.

    @terryjreedy terryjreedy added 3.8 (EOL) end of life and removed 3.7 (EOL) end of life labels Aug 18, 2018
    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 8, 2018

    I close the issue as WONTFIX.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants