gh-115154: Fix untokenize handling of unicode named literals #115171

pablogsal · 2024-02-08T16:01:55Z

Signed-off-by: Pablo Galindo pablogsal@gmail.com

Issue: Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

terryjreedy · 2024-02-08T17:45:18Z

This failed twice on Windows x64 with the same random file (test_peepholer.py)

======================================================================
FAIL: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='D:\\a\\cpython\\cpython\\Lib\\test\\test_peepholer.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_tokenize.py", line 1927, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "D:\a\cpython\cpython\Lib\test\test_tokenize.py", line 1823, in check_roundtrip
    self.assertEqual(tokens2_from2, tokens2)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Lists differ: [(65,[68848 chars]'), (55, '{'), (3, "'a'"), (55, ':'), (60, ' '[54206 chars] '')] != [(65,[68848 chars]'), (60, '{'), (60, "'a': "), (55, '{'), (1, '[54177 chars] '')]

First differing element 5042:
(55, '{')
(60, '{')

First list contains 3 additional elements.
First extra element 9009:
(4, '\r\n')

Diff is 150565 characters long. Set self.maxDiff to None to see it.

----------------------------------------------------------------------
Ran 1 test in 2.156s

and similarly on Windows x64 (freethreading) on test_xml_etree.py

First differing element 31519:
(60, "el .get ('NS')}")
(55, '{')

Second list contains 16 additional elements.
First extra element 32175:
(6, '')

Diff is 550642 characters long. Set self.maxDiff to None to see it.

Lib/test/test_tokenize.py

terryjreedy · 2024-02-10T02:19:57Z

Doc/Doctest failed with "python: ./Include/internal/pycore_typeobject.h:105: _PyType_GetModuleState: Assertion `et->ht_module' failed. Aborted (core dumped)". I don't know what this is.

Windows / build and test (x64) had the same failure on 'test_peepholer' as I showed above for the same binary.
Windows (free-threading) / build and test (x64) had the same failure on 'test__xml_etree.py' as I showed above for that binary.

I ran test_tokenize with free-threading debug and test_random_files failed with test_format.py, test_os.py, test_peepholer.py, and test_xml_etree.py.

EDIT I ran test.test_tokenize, so that support.is_resource_enabled("cpu") is True, so that all files are tested, not a sample. So those 4 are the only failures out of 387. (Time about 4.5 minutes.)

The patch does fix the targeted f-string failure. With a better fix, we can implement

        # TODO: Remove this once we can untokenize PEP 701 syntax
        testfiles.remove(os.path.join(tempdir, "test_fstring.py"))

terryjreedy · 2024-02-10T04:10:51Z

To determine where the patch iinitially fails in each file, I wrote this test program.

from tokenize import generate_tokens, untokenize

def test(file):
    toks = list(generate_tokens(file.readline))
    code = untokenize(toks)
    readline = iter(code.splitlines(keepends=True)).__next__
    toks2 = list(generate_tokens(readline))
    print('tok lengths:', len(toks), len(toks2))
    for t1, t2 in zip(toks, toks2):
        if t1[0:2] != t2[0:2]:
            for t in t1, t2:
                print(f'{t[0]}, {t[1]}, {t[2]}, {t[3]}\n{t[4]}', end='')
            print()
            break

for name in ('format', 'os', 'peepholer', 'xml_etree'):
    filename = f'f:/dev/3x/Lib/test/test_{name}.py'
    print(filename)
    with open(filename) as f:
        test(f)

It prints

f:/dev/3x/Lib/test/test_format.py
tok lengths: 5100 5103
60, xx{, (533, 22), (533, 25)
                    f"xx{{value:{bad_format_spec}}}yy".format(value=value)
60, xx, (533, 22), (533, 24)
                    f"xx{  value:{bad_format_spec}}  yy".format(value=value)

f:/dev/3x/Lib/test/test_os.py
tok lengths: 39699 39700
60, environ({, (1063, 38), (1063, 47)
        self.assertEqual(repr(env), f"environ({{{formatted_items}}})")
60, environ(, (1063, 38), (1063, 46)
        self.assertEqual(repr(env), f"environ({  {formatted_items}}  )")

f:/dev/3x/Lib/test/test_peepholer.py
tok lengths: 8620 8624
60, {, (620, 24), (620, 25)
            pattern = f"{{'a': {a}, 'b': {b}, 'c': {c}}}"
55, {, (620, 24), (620, 25)
            pattern = f"{  'a': {a}, 'b': {b}, 'c': {c}}  "

f:/dev/3x/Lib/test/test_xml_etree.py
tok lengths: 31556 31558
60, {, (4144, 26), (4144, 27)
                        f"{{{el.get('NS')}}}{el.get('Name')}"
55, {, (4144, 26), (4144, 27)
                        f"{  {el.get('NS')}}  {el.get('Name')}"

The failures are all about '{{' in code becoming '{' when tokenized and untokenized.

pablogsal · 2024-02-10T11:56:08Z

Thanks a lot @terryjreedy, this is extremely useful. I will try to adapt the patch today to cover this case and I will add @serhiy-storchaka tests cases to the PR

…rals

pablogsal · 2024-02-11T13:48:50Z

I changed the strategy but I think this should do the trick. Also, now that I was at it I added some fixes to get test_fstring.py correctly rountripped.

Lib/tokenize.py

Misc/NEWS.d/next/Core and Builtins/2024-02-08-16-01-18.gh-issue-115154.ji96FV.rst

lol.py

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>

Lib/tokenize.py

terryjreedy · 2024-02-11T17:45:58Z

python -m test.test_tokenize, testing all 388 (with test _fstring no longer deleted) test_x modules, passes on my machine.

AlexWaygood · 2024-02-11T17:52:58Z

python -m test.test_tokenize, testing all 388 (with test _fstring no longer deleted) test_x modules, passes on my machine.

It also passes for me, on a macbook (I passed the -ucpu option to test all files for the test_random_files test)

Lib/test/test_tokenize.py

Lib/tokenize.py

miss-islington-app · 2024-02-19T14:54:14Z

Thanks @pablogsal for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

…ythonGH-115171) (cherry picked from commit ecf16ee) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

bedevere-app · 2024-02-19T14:54:22Z

GH-115662 is a backport of this pull request to the 3.12 branch.

…H-115171) (#115662) gh-115154: Fix untokenize handling of unicode named literals (GH-115171) (cherry picked from commit ecf16ee) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…ython#115171)

pythongh-115154: Fix untokenize handling of unicode named literals

3ce1233

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal requested a review from lysnikolaou as a code owner February 8, 2024 16:01

bedevere-app bot added the awaiting core review label Feb 8, 2024

bedevere-app bot mentioned this pull request Feb 8, 2024

Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

Closed

pablogsal added the needs backport to 3.12 bug and security fixes label Feb 8, 2024

serhiy-storchaka reviewed Feb 9, 2024

View reviewed changes

Lib/test/test_tokenize.py Show resolved Hide resolved

pablogsal added 3 commits February 11, 2024 12:56

fixup! pythongh-115154: Fix untokenize handling of unicode named lite…

d212f92

…rals

Fix tokenizing of test_fstring

31f6ff4

Fix linting

ce7ddd7

serhiy-storchaka reviewed Feb 11, 2024

View reviewed changes

Lib/tokenize.py Show resolved Hide resolved

AlexWaygood reviewed Feb 11, 2024

View reviewed changes

Lib/tokenize.py Outdated Show resolved Hide resolved

AlexWaygood reviewed Feb 11, 2024

View reviewed changes

Misc/NEWS.d/next/Core and Builtins/2024-02-08-16-01-18.gh-issue-115154.ji96FV.rst Outdated Show resolved Hide resolved

lol.py Outdated Show resolved Hide resolved

pablogsal and others added 2 commits February 11, 2024 15:43

Apply suggestions from code review

c9b33ae

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>

fixup! Apply suggestions from code review

59b406c

smartbomb reviewed Feb 11, 2024

View reviewed changes

Lib/tokenize.py Outdated Show resolved Hide resolved

Fix scaped \N

4ab4352

pablogsal enabled auto-merge (squash) February 12, 2024 16:00

serhiy-storchaka reviewed Feb 12, 2024

View reviewed changes

Lib/test/test_tokenize.py Outdated Show resolved Hide resolved

serhiy-storchaka reviewed Feb 12, 2024

View reviewed changes

Lib/tokenize.py Outdated Show resolved Hide resolved

Fix multiple backslashes

b630f68

pablogsal merged commit ecf16ee into python:main Feb 19, 2024
32 checks passed

pablogsal deleted the gh-115154 branch February 19, 2024 14:54

bedevere-app bot removed the awaiting core review label Feb 19, 2024

bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Feb 19, 2024

woodruffw pushed a commit to woodruffw-forks/cpython that referenced this pull request Mar 4, 2024

pythongh-115154: Fix untokenize handling of unicode named literals (p…

aa8b0d7

…ython#115171)

diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024

pythongh-115154: Fix untokenize handling of unicode named literals (p…

c56230b

…ython#115171)

pygeek mentioned this pull request Jun 12, 2024

Tokenizer/untokenizer seems to break when linting multiline strings with variables/emojis #120377

Closed

tomasr8 mentioned this pull request Oct 5, 2024

Tokenize does not roundtrip {{ after \n #125008

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-115154: Fix untokenize handling of unicode named literals #115171

gh-115154: Fix untokenize handling of unicode named literals #115171

pablogsal commented Feb 8, 2024 •

edited by bedevere-app bot

Loading

terryjreedy commented Feb 8, 2024

terryjreedy commented Feb 10, 2024 •

edited

Loading

terryjreedy commented Feb 10, 2024

pablogsal commented Feb 10, 2024 •

edited

Loading

pablogsal commented Feb 11, 2024

terryjreedy commented Feb 11, 2024

AlexWaygood commented Feb 11, 2024

miss-islington-app bot commented Feb 19, 2024

bedevere-app bot commented Feb 19, 2024

gh-115154: Fix untokenize handling of unicode named literals #115171

gh-115154: Fix untokenize handling of unicode named literals #115171

Conversation

pablogsal commented Feb 8, 2024 • edited by bedevere-app bot Loading

terryjreedy commented Feb 8, 2024

terryjreedy commented Feb 10, 2024 • edited Loading

terryjreedy commented Feb 10, 2024

pablogsal commented Feb 10, 2024 • edited Loading

pablogsal commented Feb 11, 2024

terryjreedy commented Feb 11, 2024

AlexWaygood commented Feb 11, 2024

miss-islington-app bot commented Feb 19, 2024

bedevere-app bot commented Feb 19, 2024

pablogsal commented Feb 8, 2024 •

edited by bedevere-app bot

Loading

terryjreedy commented Feb 10, 2024 •

edited

Loading

pablogsal commented Feb 10, 2024 •

edited

Loading