Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize does not roundtrip {{ after \n #125008

Closed
wyattscarpenter opened this issue Oct 5, 2024 · 5 comments
Closed

Tokenize does not roundtrip {{ after \n #125008

wyattscarpenter opened this issue Oct 5, 2024 · 5 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes topic-parser type-bug An unexpected behavior, bug, or error

Comments

@wyattscarpenter
Copy link

wyattscarpenter commented Oct 5, 2024

Bug report

Bug description:

import tokenize, io
source_code = r'''
f"""{80 * '*'}\n{{test}}{{details}}{{test2}}\n{80 * '*'}"""
'''
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)

Expected:


f"""{80 *'*'}\n{{test}}{{details}}{{test2}}\n{80 *'*'}"""

Got:


f"""{80 *'*'}\n{test}}{{details}}{{test2}}\n{80 *'*'}"""

Note the absence of a second { in the {{ after the \n — but in no other positions.

Unlike some other roundtrip failures of tokenize, some of which are minor infelicities, this one actually creates a syntactically invalid program on roundtrip, which is quite bad. You get a SyntaxError: f-string: single '}' is not allowed when trying to use the results.

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

Linked PRs

@wyattscarpenter wyattscarpenter added the type-bug An unexpected behavior, bug, or error label Oct 5, 2024
@wyattscarpenter
Copy link
Author

wyattscarpenter commented Oct 5, 2024

Furthermore, here is the output of the following code:

import tokenize, io
source_code = r'f"\n{{test}}"'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
for t in tokens:
  print(t)
TokenInfo(type=61 (FSTRING_START), string='f"', start=(1, 0), end=(1, 2), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='\\n{', start=(1, 2), end=(1, 5), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='test}', start=(1, 6), end=(1, 11), line='f"\\n{{test}}"')
TokenInfo(type=63 (FSTRING_END), string='"', start=(1, 12), end=(1, 13), line='f"\\n{{test}}"')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 13), end=(1, 14), line='f"\\n{{test}}"')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

So, it seems that the line is getting in alright, but the \n{{ is getting turned into a \n{ in the tokenizer somehow.

Same erroneous output for the bytes version (with rb-string, BytesIO and tokenize.tokenize).

@AlexWaygood
Copy link
Member

It looks like this was a regression in Python 3.12; I can't reproduce the behaviour with Python 3.11. I'm guessing it was caused by the PEP-701 changes.

@AlexWaygood AlexWaygood added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 5, 2024
@AlexWaygood
Copy link
Member

Reproduced on the main branch as well.

@tomasr8
Copy link
Member

tomasr8 commented Oct 5, 2024

This seems to happen with other escape characters as well:

import tokenize, io
source_code = r'f"""\t{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\t{test}}"""
import tokenize, io
source_code = r'f"""\r{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\r{test}}"""

@tomasr8
Copy link
Member

tomasr8 commented Oct 5, 2024

I think the issue is in this method:

cpython/Lib/tokenize.py

Lines 187 to 208 in 16cd6cc

def escape_brackets(self, token):
characters = []
consume_until_next_bracket = False
for character in token:
if character == "}":
if consume_until_next_bracket:
consume_until_next_bracket = False
else:
characters.append(character)
if character == "{":
n_backslashes = sum(
1 for char in _itertools.takewhile(
"\\".__eq__,
characters[-2::-1]
)
)
if n_backslashes % 2 == 0:
characters.append(character)
else:
consume_until_next_bracket = True
characters.append(character)
return "".join(characters)

This PR fixed the handling of Unicode literals (e.g. \\N{foo}), but it seems to only
be checking for the presence of backslashes without checking if they are followed by N. This appears to fix that:

            if character == "{":
                n_backslashes = sum(
                    1 for char in _itertools.takewhile(
                        "\\".__eq__,
                        characters[-2::-1]
                    )
                )
-               if n_backslashes % 2 == 0:
+               if n_backslashes % 2 == 0 or characters[-1] != "N":
                    characters.append(character)
                else:
                    consume_until_next_bracket = True

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 6, 2024
…onGH-125013)

(cherry picked from commit db23b8b)

Co-authored-by: Tomas R. <tomas.roun8@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 6, 2024
…onGH-125013)

(cherry picked from commit db23b8b)

Co-authored-by: Tomas R. <tomas.roun8@gmail.com>
wyattscarpenter added a commit to wyattscarpenter/mypy that referenced this issue Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants