Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify generation of printable representation #26

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SultanOrazbayev
Copy link

Using .isprintable() method to improve the readability of the function (the intent of the verification becomes clearer) and to reduce the explicit unicodedata import.

Using .isprintable() method to improve the readability of the function and to reduce the explicit `unicodedata` import.
@karpathy
Copy link
Owner

are we sure these are equivalent

@SultanOrazbayev
Copy link
Author

SultanOrazbayev commented Feb 21, 2024

They are not equivalent, but using the suggested approach also resolves special cases of U+2028 and U+2029 and other space separators listed here that are not captured by the current approach (checking if the category starts with "C").

So, the current approach will capture all the categories that start with C, which according to table 12 here are in the "Other" category (last line of the table).

str.isprintable will:

Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when [repr()](https://docs.python.org/3/library/functions.html#repr) is invoked on a string. It has no bearing on the handling of strings written to [sys.stdout](https://docs.python.org/3/library/sys.html#sys.stdout) or [sys.stderr](https://docs.python.org/3/library/sys.html#sys.stderr).)

So the isprintable method captures also the special cases of line and paragraph separators, U+2028 and U+2029. They are tricky in the sense that they would show up as spaces during standard python printing (so from user perspective they are not distinct from a typical space between words):

from unicodedata import category

# https://www.compart.com/en/unicode/category/Zs
for char in [u"\u2028", u"\u2029", u"\u1680"]:
    print(f"{ord(char):#04x}: {char}")
    print(f"{char.isprintable()=}") # is not printable, so would be escaped with unicode
    print(f'{category(char)}') # does not start with C, so would be included as-is
    print()
0x2028: 

char.isprintable()=False
Zl

0x2029: 

char.isprintable()=False
Zp

0x1680:  
char.isprintable()=False
Zs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants