Skip to content

bpo-43950: handle wide unicode characters in tracebacks #28150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

isidentical
Copy link
Member

@isidentical isidentical commented Sep 4, 2021

Do not merge yet, only for discussion.

This PR adds support for the existing traceback machinery to work with wide unicode characters when dumping to the terminal. It uses unicodedata.east_asian_width to classify individual unicode characters.
image

https://bugs.python.org/issue43950

@isidentical
Copy link
Member Author

CC: @pablogsal @ammaraskar

@pablogsal
Copy link
Member

Someone should try this on Windows with "Courier New" as Terry mentioned in the issue. It will be good to check with a bunch of fonts to see what we are up against.

@isidentical
Copy link
Member Author

isidentical commented Sep 4, 2021

some notes One thing I noticed while doing the migration was that, C tokenizer doesn't match with the Python one regarding how to treat unicode symbols with multiple code points;
tokenizer:  python
  3xREGULAR
    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'")
  3xTINY H
    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 5), line="'ʰʰʰ'")
  3xEAST ASIAN
    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 5), line="'该该该'")
  3xEMOJIs
    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 5), line="'🐍🐍🐍'")
tokenizer:  c
  3xREGULAR
    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'\n")
  3xTINY H
    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 8), line="'ʰʰʰ'\n")
  3xEAST ASIAN
    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 11), line="'该该该'\n")
  3xEMOJIs
    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 14), line="'🐍🐍🐍'\n")

Which then directly mirrored back to AST offets, which makes stuff a bit trickier at the traceback level. We could just patch this on the traceback level though I am not really sure that 'hack' is the proper way to go. We might need to end up a similiar solution to our _PyPegen_byte_offset_to_character_offset in the AST logic too, but I am not too certain about how.

Seems to work now with my last commit
image

I simply applied the same functionality at the end of the specialization function, and casted the newly retrieved AST offsets back to the regular ones. There are a lot of code that needs to be cleaned away in the final version, though for the prototype at least it works (for now).

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Oct 12, 2021
@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Aug 12, 2022
@isidentical
Copy link
Member Author

Wow, this is an old PR but @pablogsal @ammaraskar are we interested in reviving it? I think I can rebase and get it ready for 3.12.

@pablogsal
Copy link
Member

pablogsal commented Nov 2, 2022

I am. I was precisely thinking on reviving it last week so you managed to read my mind! 😁

@isidentical
Copy link
Member Author

Hahahaha, perfection! I'll try to get it up to date over this week, and will ping you again for the review 💯

@pablogsal
Copy link
Member

Hahahaha, perfection

giphy

@ammaraskar
Copy link
Member

Sounds good, happy to re-review when you update :)

Sorry for the slow follow-ups around PEP657 stuff, I've been a little busy and inactive recently :(

@isidentical
Copy link
Member Author

isidentical commented Nov 13, 2022

I've pushed the initial revision where we handle everything using the unicodedata.east_asian_width (double width on W and F labels, single on rest) as per the discussion here. Before merging this we also have to decide:

  • Do we want to give users a way to opt-in/opt-out from this (considering the nature, I'd say opt-out, since I don't think anybody is going to look why behaves like this and turn this feature on)?

  • Whether to do some sort of terminal support detection (non-deterministic tracebacks depending on non PYTHON* env variable configuration doesn't sound really nice but no strong opinions if we want to make this opt-In by default and add some sort of scanning)

  • Is this applicable when showing the traceback in web? A lot of web frameworks use traceback.* APIs to format tracebacks and then embed them inside an HTML page on error in debug mode. Should we offer a flag in format_* APIs to allow them switch without changing it in the global Python exectuon level.

  • Is this a feature or a fix? I am leaning towards a feature since this essentially changes how tracebacks are printed for certain cases and people might depend on the behavior (there are a couple of codebases where tracebacks are also part of the tests themselves in a stringized form)

I guess also looking at what other compilers are doing by default might also help us gain some insight (I recall @pablogsal mentioning rustc; maybe it might worth a shot to check out what they do to decide when to show carets).

@isidentical isidentical marked this pull request as ready for review November 13, 2022 01:24
@isidentical
Copy link
Member Author

CC: @cfbolz (would also love to hear your feedback on the unicode related parts)

@cfbolz
Copy link
Contributor

cfbolz commented Nov 14, 2022

I'll take a look at the code!

"Amusingly" the width doesn't line up in my browser's font:

image

(Looks fantastic in my editor and my terminal though)

Personal opinions on some of your questions:

  • imo it's a bug. for people with wide characters the position info was just confusing so far, fixing that is a bug fix.
  • maybe we want a way to opt out (I have no clear opinion on that) but the new behaviour should be the default

@cfbolz
Copy link
Contributor

cfbolz commented Nov 14, 2022

The code looks reasonable to me.

I've been thinking about it a bit more, and it would be certainly more annoying to implement, but I am wondering whether it wouldn't be an option to use unicode chars 0x3000 (IDEOGRAPHIC SPACE) and 0xFF3E (FULLWIDTH CIRCUMFLEX ACCENT) to do the spaces/underlines under wide chars. Because even if in a font the width of two ascii spaces is not the same as a fullwidth char, the font should at least be consistent with itself and have the fullwidth space be the same width. Example:

Here are the chars:

说明说明📗a 
     a fullwidth space
^^^^^a fullwidth circumflex
          a ascii whitespace
^^^^^^^^^^a ascii circumflex

Screenshot in my Firefox:

image

(in my terminal all the 'a' line up, so it doesn't matter there).

@cfbolz
Copy link
Contributor

cfbolz commented Nov 14, 2022

(Unfortunately it already breaks down in Chrome on my laptop, where the book emoji is even wider)

@pablogsal
Copy link
Member

@isidentical let's push this forward. Could you rebase the PR?

@pablogsal pablogsal enabled auto-merge (squash) October 26, 2023 06:41
@cfbolz
Copy link
Contributor

cfbolz commented Oct 26, 2023

yay, i'm excited for this to land :-)

@pablogsal pablogsal merged commit 78e6d72 into python:main Oct 26, 2023
pablogsal added a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023
…nGH-28150)

(cherry picked from commit 78e6d72)

Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Oct 26, 2023

GH-111345 is a backport of this pull request to the 3.11 branch.

pablogsal added a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023
…nGH-28150)

(cherry picked from commit 78e6d72)

Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Oct 26, 2023

GH-111346 is a backport of this pull request to the 3.12 branch.

pablogsal pushed a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023
…nGH-28150)

(cherry picked from commit 78e6d72)

Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
pablogsal pushed a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023
…nGH-28150)

(cherry picked from commit 78e6d72)

Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
pablogsal pushed a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023
…nGH-28150)

(cherry picked from commit 78e6d72)

Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Oct 26, 2023

GH-111373 is a backport of this pull request to the 3.11 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants