Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115154: Fix untokenize handling of unicode named literals #115171

Merged
merged 8 commits into from
Feb 19, 2024

Conversation

pablogsal
Copy link
Member

@pablogsal pablogsal commented Feb 8, 2024

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
@terryjreedy
Copy link
Member

This failed twice on Windows x64 with the same random file (test_peepholer.py)

======================================================================
FAIL: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='D:\\a\\cpython\\cpython\\Lib\\test\\test_peepholer.py')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_tokenize.py", line 1927, in test_random_files
    self.check_roundtrip(f)
    ~~~~~~~~~~~~~~~~~~~~^^^
  File "D:\a\cpython\cpython\Lib\test\test_tokenize.py", line 1823, in check_roundtrip
    self.assertEqual(tokens2_from2, tokens2)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Lists differ: [(65,[68848 chars]'), (55, '{'), (3, "'a'"), (55, ':'), (60, ' '[54206 chars] '')] != [(65,[68848 chars]'), (60, '{'), (60, "'a': "), (55, '{'), (1, '[54177 chars] '')]

First differing element 5042:
(55, '{')
(60, '{')

First list contains 3 additional elements.
First extra element 9009:
(4, '\r\n')

Diff is 150565 characters long. Set self.maxDiff to None to see it.

----------------------------------------------------------------------
Ran 1 test in 2.156s

and similarly on Windows x64 (freethreading) on test_xml_etree.py

First differing element 31519:
(60, "el .get ('NS')}")
(55, '{')

Second list contains 16 additional elements.
First extra element 32175:
(6, '')

Diff is 550642 characters long. Set self.maxDiff to None to see it.

@terryjreedy
Copy link
Member

terryjreedy commented Feb 10, 2024

Doc/Doctest failed with "python: ./Include/internal/pycore_typeobject.h:105: _PyType_GetModuleState: Assertion `et->ht_module' failed. Aborted (core dumped)". I don't know what this is.

Windows / build and test (x64) had the same failure on 'test_peepholer' as I showed above for the same binary.
Windows (free-threading) / build and test (x64) had the same failure on 'test__xml_etree.py' as I showed above for that binary.

I ran test_tokenize with free-threading debug and test_random_files failed with test_format.py, test_os.py, test_peepholer.py, and test_xml_etree.py.

EDIT I ran test.test_tokenize, so that support.is_resource_enabled("cpu") is True, so that all files are tested, not a sample. So those 4 are the only failures out of 387. (Time about 4.5 minutes.)

The patch does fix the targeted f-string failure. With a better fix, we can implement

        # TODO: Remove this once we can untokenize PEP 701 syntax
        testfiles.remove(os.path.join(tempdir, "test_fstring.py"))

@terryjreedy
Copy link
Member

To determine where the patch iinitially fails in each file, I wrote this test program.

from tokenize import generate_tokens, untokenize

def test(file):
    toks = list(generate_tokens(file.readline))
    code = untokenize(toks)
    readline = iter(code.splitlines(keepends=True)).__next__
    toks2 = list(generate_tokens(readline))
    print('tok lengths:', len(toks), len(toks2))
    for t1, t2 in zip(toks, toks2):
        if t1[0:2] != t2[0:2]:
            for t in t1, t2:
                print(f'{t[0]}, {t[1]}, {t[2]}, {t[3]}\n{t[4]}', end='')
            print()
            break

for name in ('format', 'os', 'peepholer', 'xml_etree'):
    filename = f'f:/dev/3x/Lib/test/test_{name}.py'
    print(filename)
    with open(filename) as f:
        test(f)

It prints

f:/dev/3x/Lib/test/test_format.py
tok lengths: 5100 5103
60, xx{, (533, 22), (533, 25)
                    f"xx{{value:{bad_format_spec}}}yy".format(value=value)
60, xx, (533, 22), (533, 24)
                    f"xx{  value:{bad_format_spec}}  yy".format(value=value)

f:/dev/3x/Lib/test/test_os.py
tok lengths: 39699 39700
60, environ({, (1063, 38), (1063, 47)
        self.assertEqual(repr(env), f"environ({{{formatted_items}}})")
60, environ(, (1063, 38), (1063, 46)
        self.assertEqual(repr(env), f"environ({  {formatted_items}}  )")

f:/dev/3x/Lib/test/test_peepholer.py
tok lengths: 8620 8624
60, {, (620, 24), (620, 25)
            pattern = f"{{'a': {a}, 'b': {b}, 'c': {c}}}"
55, {, (620, 24), (620, 25)
            pattern = f"{  'a': {a}, 'b': {b}, 'c': {c}}  "

f:/dev/3x/Lib/test/test_xml_etree.py
tok lengths: 31556 31558
60, {, (4144, 26), (4144, 27)
                        f"{{{el.get('NS')}}}{el.get('Name')}"
55, {, (4144, 26), (4144, 27)
                        f"{  {el.get('NS')}}  {el.get('Name')}"

The failures are all about '{{' in code becoming '{' when tokenized and untokenized.

@pablogsal
Copy link
Member Author

pablogsal commented Feb 10, 2024

Thanks a lot @terryjreedy, this is extremely useful. I will try to adapt the patch today to cover this case and I will add @serhiy-storchaka tests cases to the PR

@pablogsal
Copy link
Member Author

I changed the strategy but I think this should do the trick. Also, now that I was at it I added some fixes to get test_fstring.py correctly rountripped.

Lib/tokenize.py Outdated Show resolved Hide resolved
pablogsal and others added 2 commits February 11, 2024 15:43
Lib/tokenize.py Outdated Show resolved Hide resolved
@terryjreedy
Copy link
Member

python -m test.test_tokenize, testing all 388 (with test _fstring no longer deleted) test_x modules, passes on my machine.

@AlexWaygood
Copy link
Member

python -m test.test_tokenize, testing all 388 (with test _fstring no longer deleted) test_x modules, passes on my machine.

It also passes for me, on a macbook (I passed the -ucpu option to test all files for the test_random_files test)

@pablogsal pablogsal enabled auto-merge (squash) February 12, 2024 16:00
Lib/tokenize.py Outdated Show resolved Hide resolved
@pablogsal pablogsal merged commit ecf16ee into python:main Feb 19, 2024
32 checks passed
@pablogsal pablogsal deleted the gh-115154 branch February 19, 2024 14:54
@miss-islington-app
Copy link

Thanks @pablogsal for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 19, 2024
…ythonGH-115171)

(cherry picked from commit ecf16ee)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Feb 19, 2024

GH-115662 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Feb 19, 2024
pablogsal added a commit that referenced this pull request Feb 19, 2024
…H-115171) (#115662)

gh-115154: Fix untokenize handling of unicode named literals (GH-115171)
(cherry picked from commit ecf16ee)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this pull request Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants