gh-63161: Fix PEP 263 support #139481

serhiy-storchaka · 2025-10-01T15:30:39Z

Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified.
Detect decoding error in comments for UTF-8 encoding.

Issue: Non-UTF8 encoding line #63161

* Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified. * Detect decoding error in comments for UTF-8 encoding.

ashm-dev · 2025-10-01T17:24:40Z

Parser/tokenizer/file_tokenizer.c

+        const char *line = tok->lineno <= 2 ? tok->buf : tok->cur;
+        int lineno = tok->lineno <= 2 ? 1 : tok->lineno;
+        if (!tok->encoding) {
+            /* The default encoding is UTF-8, so make sure we don't have any
+               non-UTF-8 sequences in it. */
+            if (!_PyTokenizer_ensure_utf8(line, tok, lineno)) {
+                _PyTokenizer_error_ret(tok);
+                return 0;
+            }
+        }
+        else {
+            PyObject *tmp = PyUnicode_Decode(line, strlen(line),


Suggested change

const char *line = tok->lineno <= 2 ? tok->buf : tok->cur;

int lineno = tok->lineno <= 2 ? 1 : tok->lineno;

if (!tok->encoding) {

/* The default encoding is UTF-8, so make sure we don't have any

non-UTF-8 sequences in it. */

if (!_PyTokenizer_ensure_utf8(line, tok, lineno)) {

_PyTokenizer_error_ret(tok);

return 0;

}

}

else {

PyObject *tmp = PyUnicode_Decode(line, strlen(line),

const int is_pseudo_line = (tok->lineno <= 2);

const char *line = is_pseudo_line ? tok->buf : tok->cur;

int lineno = is_pseudo_line ? 1 : tok->lineno;

size_t slen = strlen(line);

if (slen > (size_t)PY_SSIZE_T_MAX) {

_PyTokenizer_error_ret(tok);

return 0;

}

Py_ssize_t linelen = (Py_ssize_t)slen;

if (!tok->encoding) {

/* The default encoding is UTF-8, so make sure we don't have any

non-UTF-8 sequences in it. */

if (!_PyTokenizer_ensure_utf8(line, tok, lineno)) {

_PyTokenizer_error_ret(tok);

return 0;

}

}

else {

PyObject *tmp = PyUnicode_Decode(line, linelen,

vstinner

LGTM. I am not sure about the tokenizer changes, but I trust unit tests :-)

serhiy-storchaka · 2025-10-03T14:35:27Z

Unfortunately there was a regression which caused one of existing tests to fail. Earlier, decoding error for default (UTF-8) encoding was raised only when the tokenizer tried to decode an identifier or string literal. So you had an affected line with underscored identifier or string literal containing undecodable bytes in a traceback. Now it is raised at the beginning of parsing string or after reading a line from the file (only for first few lines).

Fixing this regression was not easy. But now you have a nice line with the cursor pointing exactly to the undecodable byte in a traceback, and this works in more cases than earlier.

But it did not work and still does not work if the encoding is explicitly specified. Then you get a SyntaxError without correct reference to the position of decoding error. This is a different complex issue.

miss-islington-app · 2025-10-10T12:51:23Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

* Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified. * Detect decoding error in comments for UTF-8 encoding. * Include the decoding error position for default encoding in SyntaxError. (cherry picked from commit 5c942f1) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

miss-islington-app · 2025-10-10T12:51:31Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.13 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 5c942f11cdf5f9d7313200983fa0c58b3bc670a2 3.13

bedevere-app · 2025-10-10T12:51:37Z

GH-139898 is a backport of this pull request to the 3.14 branch.

* Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified. * Detect decoding error in comments for UTF-8 encoding. * Include the decoding error position for default encoding in SyntaxError. (cherry picked from commit 5c942f1) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

pythongh-63161: Fix PEP 263 support

7e9910e

* Support non-UTF-8 shebang and comments if non-UTF-8 encoding is specified. * Detect decoding error in comments for UTF-8 encoding.

serhiy-storchaka requested a review from vstinner October 1, 2025 15:30

serhiy-storchaka requested review from lysnikolaou and pablogsal as code owners October 1, 2025 15:30

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Oct 1, 2025

bedevere-app bot added the awaiting core review label Oct 1, 2025

bedevere-app bot mentioned this pull request Oct 1, 2025

Non-UTF8 encoding line #63161

Open

ashm-dev reviewed Oct 1, 2025

View reviewed changes

vstinner approved these changes Oct 2, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Oct 2, 2025

Include the decoding error position for default encoding in SyntaxError.

3ab168a

serhiy-storchaka requested a review from iritkatriel as a code owner October 3, 2025 14:24

serhiy-storchaka added 4 commits October 3, 2025 17:49

Try to disable colorization.

9fd1bb2

Fix tests on Windows.

62993b3

Merge branch 'main' into source-encoding

652a503

Silence some compiler warnings.

f932ebd

serhiy-storchaka enabled auto-merge (squash) October 10, 2025 12:42

serhiy-storchaka merged commit 5c942f1 into python:main Oct 10, 2025
81 of 83 checks passed

bedevere-app bot removed the awaiting merge label Oct 10, 2025

miss-islington-app bot assigned serhiy-storchaka Oct 10, 2025

bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Oct 10, 2025

serhiy-storchaka removed the needs backport to 3.13 bugs and security fixes label Oct 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-63161: Fix PEP 263 support #139481

gh-63161: Fix PEP 263 support #139481

Uh oh!

serhiy-storchaka commented Oct 1, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

ashm-dev Oct 1, 2025

Uh oh!

vstinner left a comment

Uh oh!

serhiy-storchaka commented Oct 3, 2025

Uh oh!

Uh oh!

miss-islington-app bot commented Oct 10, 2025

Uh oh!

miss-islington-app bot commented Oct 10, 2025

Uh oh!

bedevere-app bot commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

gh-63161: Fix PEP 263 support #139481

gh-63161: Fix PEP 263 support #139481

Uh oh!

Conversation

serhiy-storchaka commented Oct 1, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashm-dev Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Oct 3, 2025

Uh oh!

Uh oh!

miss-islington-app bot commented Oct 10, 2025

Uh oh!

miss-islington-app bot commented Oct 10, 2025

Uh oh!

bedevere-app bot commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

serhiy-storchaka commented Oct 1, 2025 •

edited by bedevere-app bot

Loading