bpo-43950: handle wide unicode characters in tracebacks #28150

isidentical · 2021-09-04T08:29:56Z

Do not merge yet, only for discussion.

This PR adds support for the existing traceback machinery to work with wide unicode characters when dumping to the terminal. It uses unicodedata.east_asian_width to classify individual unicode characters.

https://bugs.python.org/issue43950

isidentical · 2021-09-04T08:30:06Z

CC: @pablogsal @ammaraskar

pablogsal · 2021-09-04T15:09:25Z

Someone should try this on Windows with "Courier New" as Terry mentioned in the issue. It will be good to check with a bunch of fonts to see what we are up against.

Parser/pegen.c

isidentical · 2021-09-04T22:51:02Z

some notes

One thing I noticed while doing the migration was that, C tokenizer doesn't match with the Python one regarding how to treat unicode symbols with multiple code points;

tokenizer:  python
  3xREGULAR
    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'")
  3xTINY H
    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 5), line="'ʰʰʰ'")
  3xEAST ASIAN
    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 5), line="'该该该'")
  3xEMOJIs
    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 5), line="'🐍🐍🐍'")
tokenizer:  c
  3xREGULAR
    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'\n")
  3xTINY H
    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 8), line="'ʰʰʰ'\n")
  3xEAST ASIAN
    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 11), line="'该该该'\n")
  3xEMOJIs
    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 14), line="'🐍🐍🐍'\n")

Which then directly mirrored back to AST offets, which makes stuff a bit trickier at the traceback level. We could just patch this on the traceback level though I am not really sure that 'hack' is the proper way to go. We might need to end up a similiar solution to our _PyPegen_byte_offset_to_character_offset in the AST logic too, but I am not too certain about how.

Seems to work now with my last commit

I simply applied the same functionality at the end of the specialization function, and casted the newly retrieved AST offsets back to the regular ones. There are a lot of code that needs to be cleaned away in the final version, though for the prototype at least it works (for now).

github-actions · 2021-10-12T00:05:39Z

This PR is stale because it has been open for 30 days with no activity.

isidentical · 2022-11-02T00:35:29Z

Wow, this is an old PR but @pablogsal @ammaraskar are we interested in reviving it? I think I can rebase and get it ready for 3.12.

pablogsal · 2022-11-02T10:26:09Z

I am. I was precisely thinking on reviving it last week so you managed to read my mind! 😁

isidentical · 2022-11-02T15:34:32Z

Hahahaha, perfection! I'll try to get it up to date over this week, and will ping you again for the review 💯

pablogsal · 2022-11-02T16:42:30Z

Hahahaha, perfection

ammaraskar · 2022-11-02T20:56:26Z

Sounds good, happy to re-review when you update :)

Sorry for the slow follow-ups around PEP657 stuff, I've been a little busy and inactive recently :(

isidentical · 2022-11-13T01:24:40Z

I've pushed the initial revision where we handle everything using the unicodedata.east_asian_width (double width on W and F labels, single on rest) as per the discussion here. Before merging this we also have to decide:

Do we want to give users a way to opt-in/opt-out from this (considering the nature, I'd say opt-out, since I don't think anybody is going to look why behaves like this and turn this feature on)?
Whether to do some sort of terminal support detection (non-deterministic tracebacks depending on non PYTHON* env variable configuration doesn't sound really nice but no strong opinions if we want to make this opt-In by default and add some sort of scanning)
Is this applicable when showing the traceback in web? A lot of web frameworks use traceback.* APIs to format tracebacks and then embed them inside an HTML page on error in debug mode. Should we offer a flag in format_* APIs to allow them switch without changing it in the global Python exectuon level.
Is this a feature or a fix? I am leaning towards a feature since this essentially changes how tracebacks are printed for certain cases and people might depend on the behavior (there are a couple of codebases where tracebacks are also part of the tests themselves in a stringized form)

I guess also looking at what other compilers are doing by default might also help us gain some insight (I recall @pablogsal mentioning rustc; maybe it might worth a shot to check out what they do to decide when to show carets).

isidentical · 2022-11-13T01:25:38Z

CC: @cfbolz (would also love to hear your feedback on the unicode related parts)

cfbolz · 2022-11-14T08:25:26Z

I'll take a look at the code!

"Amusingly" the width doesn't line up in my browser's font:

(Looks fantastic in my editor and my terminal though)

Personal opinions on some of your questions:

imo it's a bug. for people with wide characters the position info was just confusing so far, fixing that is a bug fix.
maybe we want a way to opt out (I have no clear opinion on that) but the new behaviour should be the default

cfbolz · 2022-11-14T08:48:40Z

The code looks reasonable to me.

I've been thinking about it a bit more, and it would be certainly more annoying to implement, but I am wondering whether it wouldn't be an option to use unicode chars 0x3000 (IDEOGRAPHIC SPACE) and 0xFF3E (FULLWIDTH CIRCUMFLEX ACCENT) to do the spaces/underlines under wide chars. Because even if in a font the width of two ascii spaces is not the same as a fullwidth char, the font should at least be consistent with itself and have the fullwidth space be the same width. Example:

Here are the chars:

说明说明📗a 
　　　　　a fullwidth space
＾＾＾＾＾a fullwidth circumflex
          a ascii whitespace
^^^^^^^^^^a ascii circumflex

Screenshot in my Firefox:

(in my terminal all the 'a' line up, so it doesn't matter there).

cfbolz · 2022-11-14T08:51:23Z

(Unfortunately it already breaks down in Chrome on my laptop, where the book emoji is even wider)

pablogsal · 2023-05-25T14:13:28Z

@isidentical let's push this forward. Could you rebase the PR?

cfbolz · 2023-10-26T06:55:23Z

yay, i'm excited for this to land :-)

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

bedevere-app · 2023-10-26T07:12:16Z

GH-111345 is a backport of this pull request to the 3.11 branch.

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

bedevere-app · 2023-10-26T07:15:29Z

GH-111346 is a backport of this pull request to the 3.12 branch.

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com> Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

) (#111346)

bedevere-app · 2023-10-26T22:56:55Z

GH-111373 is a backport of this pull request to the 3.11 branch.

) (#111373)

isidentical added the DO-NOT-MERGE label Sep 4, 2021

the-knights-who-say-ni added the CLA signed label Sep 4, 2021

bedevere-bot added the awaiting core review label Sep 4, 2021

isidentical added the skip news label Sep 4, 2021

ammaraskar reviewed Sep 4, 2021

View reviewed changes

Parser/pegen.c Outdated Show resolved Hide resolved

Parser/pegen.c Outdated Show resolved Hide resolved

isidentical force-pushed the bpo-43950-emojis branch from f4b35ac to 6ed3a9d Compare September 8, 2021 21:47

github-actions bot added the stale Stale PR or inactive for long period of time. label Oct 12, 2021

pablogsal mentioned this pull request Apr 13, 2022

Include column offsets for bytecode instructions #88116

Closed

ezio-melotti removed the CLA signed label Jul 13, 2022

github-actions bot removed the stale Stale PR or inactive for long period of time. label Aug 12, 2022

isidentical force-pushed the bpo-43950-emojis branch from 6ed3a9d to 8102a9d Compare November 13, 2022 01:14

isidentical marked this pull request as ready for review November 13, 2022 01:24

isidentical requested review from lysnikolaou and iritkatriel as code owners November 13, 2022 01:24

pythongh-88116: Handle wide unicode characters in tracebacks

5ec6af6

pablogsal force-pushed the bpo-43950-emojis branch from 8102a9d to 5ec6af6 Compare October 26, 2023 06:35

pablogsal approved these changes Oct 26, 2023

View reviewed changes

pablogsal removed the DO-NOT-MERGE label Oct 26, 2023

pablogsal enabled auto-merge (squash) October 26, 2023 06:41

pablogsal merged commit 78e6d72 into python:main Oct 26, 2023

bedevere-app bot removed the awaiting core review label Oct 26, 2023

pablogsal added a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023

[3.11] bpo-43950: handle wide unicode characters in tracebacks (pytho…

d24d3f9

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023

[3.11] bpo-43950: handle wide unicode characters in tracebacks (pytho…

d7cebb3

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

pablogsal pushed a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023

[3.12] bpo-43950: handle wide unicode characters in tracebacks (pytho…

d5ee672

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

pablogsal pushed a commit to pablogsal/cpython that referenced this pull request Oct 26, 2023

[3.12] bpo-43950: handle wide unicode characters in tracebacks (pytho…

a5018f1

…nGH-28150) (cherry picked from commit 78e6d72) Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

pablogsal added a commit that referenced this pull request Oct 26, 2023

[3.12] bpo-43950: handle wide unicode characters in tracebacks (GH-28150

c81ebf5

) (#111346)

pablogsal added a commit that referenced this pull request Oct 27, 2023

[3.11] bpo-43950: handle wide unicode characters in tracebacks (GH-28150

22cde39

) (#111373)

aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024

bpo-43950: handle wide unicode characters in tracebacks (python#28150)

dd48f67

Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024

bpo-43950: handle wide unicode characters in tracebacks (python#28150)

fee2cec

HarryLHW mentioned this pull request Feb 18, 2025

Traceback colors are shifted when the line contains wide unicode characters #130273

Open

Uh oh!

bpo-43950: handle wide unicode characters in tracebacks #28150

bpo-43950: handle wide unicode characters in tracebacks #28150

Uh oh!

Conversation

isidentical commented Sep 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isidentical commented Sep 4, 2021

Uh oh!

pablogsal commented Sep 4, 2021

Uh oh!

Uh oh!

Uh oh!

isidentical commented Sep 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 12, 2021

Uh oh!

isidentical commented Nov 2, 2022

Uh oh!

pablogsal commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isidentical commented Nov 2, 2022

Uh oh!

pablogsal commented Nov 2, 2022

Uh oh!

ammaraskar commented Nov 2, 2022

Uh oh!

isidentical commented Nov 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isidentical commented Nov 13, 2022

Uh oh!

cfbolz commented Nov 14, 2022

Uh oh!

cfbolz commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfbolz commented Nov 14, 2022

Uh oh!

pablogsal commented May 25, 2023

Uh oh!

cfbolz commented Oct 26, 2023

Uh oh!

bedevere-app bot commented Oct 26, 2023

Uh oh!

bedevere-app bot commented Oct 26, 2023

Uh oh!

bedevere-app bot commented Oct 26, 2023

Uh oh!

Uh oh!

isidentical commented Sep 4, 2021 •

edited

Loading

isidentical commented Sep 4, 2021 •

edited

Loading

pablogsal commented Nov 2, 2022 •

edited

Loading

isidentical commented Nov 13, 2022 •

edited

Loading

cfbolz commented Nov 14, 2022 •

edited

Loading