Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wc might be an empty string #24

Closed
choldrim opened this issue Feb 8, 2018 · 4 comments
Closed

wc might be an empty string #24

choldrim opened this issue Feb 8, 2018 · 4 comments
Labels

Comments

@choldrim
Copy link

choldrim commented Feb 8, 2018

wcwidth/wcwidth/wcwidth.py

Lines 104 to 182 in c71459e

def wcwidth(wc):
r"""
Given one unicode character, return its printable length on a terminal.
The wcwidth() function returns 0 if the wc argument has no printable effect
on a terminal (such as NUL '\0'), -1 if wc is not printable, or has an
indeterminate effect on the terminal, such as a control character.
Otherwise, the number of column positions the character occupies on a
graphic terminal (1 or 2) is returned.
The following have a column width of -1:
- C0 control characters (U+001 through U+01F).
- C1 control characters and DEL (U+07F through U+0A0).
The following have a column width of 0:
- Non-spacing and enclosing combining characters (general
category code Mn or Me in the Unicode database).
- NULL (U+0000, 0).
- COMBINING GRAPHEME JOINER (U+034F).
- ZERO WIDTH SPACE (U+200B) through
RIGHT-TO-LEFT MARK (U+200F).
- LINE SEPERATOR (U+2028) and
PARAGRAPH SEPERATOR (U+2029).
- LEFT-TO-RIGHT EMBEDDING (U+202A) through
RIGHT-TO-LEFT OVERRIDE (U+202E).
- WORD JOINER (U+2060) through
INVISIBLE SEPARATOR (U+2063).
The following have a column width of 1:
- SOFT HYPHEN (U+00AD) has a column width of 1.
- All remaining characters (including all printable
ISO 8859-1 and WGL4 characters, Unicode control characters,
etc.) have a column width of 1.
The following have a column width of 2:
- Spacing characters in the East Asian Wide (W) or East Asian
Full-width (F) category as defined in Unicode Technical
Report #11 have a column width of 2.
"""
# pylint: disable=C0103
# Invalid argument name "wc"
ucs = ord(wc)
# NOTE: created by hand, there isn't anything identifiable other than
# general Cf category code to identify these, and some characters in Cf
# category code are of non-zero width.
# pylint: disable=too-many-boolean-expressions
# Too many boolean expressions in if statement (7/5)
if (ucs == 0 or
ucs == 0x034F or
0x200B <= ucs <= 0x200F or
ucs == 0x2028 or
ucs == 0x2029 or
0x202A <= ucs <= 0x202E or
0x2060 <= ucs <= 0x2063):
return 0
# C0/C1 control characters
if ucs < 32 or 0x07F <= ucs < 0x0A0:
return -1
# combining characters with zero width
if _bisearch(ucs, ZERO_WIDTH):
return 0
return 1 + _bisearch(ucs, WIDE_EASTASIAN)

if wc is an empty string, an TypeError: ord() expected a character... exception will be raised

if there would be a statement if len(wc) == 0: return 0 before ord(wc), it will be better, I think.

@serhiy-storchaka
Copy link

From docstring: "Given one unicode character, ...".

Empty string is not one unicode character. You pass value which does satisfy function contract and get an error. Similar error you will get when pass an unicode string containing more than one character.

@jquast
Copy link
Owner

jquast commented Dec 20, 2020

I agree and thanks for the report I will work on this as soon as I am able

@wjandrea
Copy link

@serhiy-storchaka, The problem is that there should be a specific error about it. For example, it could be as simple as subbing out ord for wcwidth in that error:

if len(wc) != 1:
    msg = 'wcwidth() expected a character, but string of length {} found'.format(len(wc))
    raise TypeError(msg)
ucs = ord(wc)

Example run:

>>> wcwidth('')
Traceback (most recent call last):
  ...
TypeError: wcwidth() expected a character, but string of length 0 found
>>> wcwidth('ab')
Traceback (most recent call last):
  ...
TypeError: wcwidth() expected a character, but string of length 2 found

jquast added a commit that referenced this issue Oct 30, 2023
Major
-----

Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow !

This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables.

Tests
-----

- `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication.
- new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada.  
- added pytest-benchmark plugin, example use:

        # baseline
        tox -epy312 -- --verbose --benchmark-save=original
        # compare
        tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
@jquast
Copy link
Owner

jquast commented Oct 30, 2023

This is fixed in today's release. On empty string, wcwidth returns 0. Thanks!

@jquast jquast closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants