Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer produces different output on Windows on py312 for ends of files #105017

Closed
AlexWaygood opened this issue May 27, 2023 · 6 comments
Closed
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes OS-windows type-bug An unexpected behavior, bug, or error

Comments

@AlexWaygood
Copy link
Member

AlexWaygood commented May 27, 2023

Bug report

If you copy and paste the following code into a repro.py file and run python -m tokenize on it on a Windows machine, the output is different on 3.12/3.13 compared to what it was on 3.11 (the file ends with a single newline):

foo = 'bar'
spam = 'eggs'

On Python 3.11, on Windows, the output is this:

> python -m tokenize cpython/repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,13:          NEWLINE        '\r\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,15:          NEWLINE        '\r\n'
3,0-3,0:            ENDMARKER      ''

On Python 3.13 (@ 6e62eb2) on Windows, however, the output is this:

> python -m tokenize repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,14:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            ENDMARKER      ''

There appear to be two changes here:

  • All the NEWLINE tokens now have \n values, whereas on Python 3.11 they all had \r\n values
  • There is an additional NL token at the end, immediately before the ENDMARKER token.

As discussed in PyCQA/pycodestyle#1142, this appears to be Windows-specific, and may be the cause of a large number of spurious W391 errors from the pycodestyle linting tool. (W391 dictates that there should be one, and only one, newline at the end of a file.) The pycodestyle tool is included in flake8, meaning that test failures in pycodestyle can cause test failures for other flake8 plugins. (All tests for the flake8-pyi plugin, for example, currently fail on Python 3.13 on Windows.)

Your environment

Python 3.13.0a0 (heads/main:6e62eb2e70, May 27 2023, 14:00:13) [MSC v.1932 64 bit (AMD64)] on win32

Linked PRs

@AlexWaygood AlexWaygood added type-bug An unexpected behavior, bug, or error OS-windows 3.12 bugs and security fixes 3.13 bugs and security fixes labels May 27, 2023
@AlexWaygood AlexWaygood changed the title Tokenizer produces different output on Windows for ends of files Tokenizer produces different output on Windows on py312 for ends of files May 27, 2023
@pablogsal
Copy link
Member

CC: @mgmacias95

@AlexWaygood
Copy link
Member Author

AlexWaygood commented May 27, 2023

#105022 fixes the issue of the additional NL token at the end of the file on Windows, but there are still differences on Windows between what the tokenizer produced on 3.11 and what it produces on #105022.

On 3.11 on Windows:

>python -m tokenize cpython/repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,13:          NEWLINE        '\r\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,15:          NEWLINE        '\r\n'
3,0-3,0:            ENDMARKER      ''

Using #105022:

> python -m tokenize repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,14:          NEWLINE        '\n'
3,0-3,0:            ENDMARKER      ''

Note that the same tokens are emitted now as were emitted on 3.11, but the end-column coordinate is off by one for each NEWLINE token.

(I'm still seeing loads of flake8-pyi test failures even using #105022, and I'm presuming this is the cause?)

pablogsal added a commit that referenced this issue May 27, 2023
Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 27, 2023
…ythonGH-105022)

(cherry picked from commit 86d8f48)

Co-authored-by: Marta Gómez Macías <mgmacias@google.com>
Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
pablogsal added a commit that referenced this issue May 27, 2023
…H-105022) (#105023)

Co-authored-by: Marta Gómez Macías <mgmacias@google.com>
Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
@pablogsal
Copy link
Member

pablogsal commented May 28, 2023

@AlexWaygood can you check again with the new PR (#105030)?

pablogsal added a commit that referenced this issue May 28, 2023
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 28, 2023
…thonGH-105030)

(cherry picked from commit 96fff35)

Co-authored-by: Marta Gómez Macías <mgmacias@google.com>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit that referenced this issue May 28, 2023
…H-105030) (#105041)

gh-105017: Include CRLF lines in strings and column numbers (GH-105030)
(cherry picked from commit 96fff35)

Co-authored-by: Marta Gómez Macías <mgmacias@google.com>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
@AlexWaygood
Copy link
Member Author

@AlexWaygood can you check again with the new PR (#105030)?

Will check either this evening or tomorrow

@AlexWaygood
Copy link
Member Author

Well, I have good news and bad news.

The good news is that, on Windows, the tokenize module appears to now be producing exactly the same output on main as it did on 3.11.

The bad news is that the spurious pycodestyle W391 errors discussed in PyCQA/pycodestyle#1142 still haven't gone away, even using CPython main. These definitely bisect to 6715f91, so something else must be going on to cause these failures :(

Regardless, the original bug reported here has definitely been fixed, so I'll close this for now, and I'll open a new issue once I've done some more investigation into what might be causing the spurious W391 errors and (hopefully) found a minimal repro that doesn't involve third-party code. Thanks @pablogsal and @mgmacias95 for all your hard work on this!

@AlexWaygood
Copy link
Member Author

I'll open a new issue once I've done some more investigation into what might be causing the spurious W391 errors and (hopefully) found a minimal repro that doesn't involve third-party code. Thanks @pablogsal and @mgmacias95 for all your hard work on this!

I've tried to investigate, but I can't immediately figure out what might be causing these W391 errors. I suspect it'll need somebody more familiar with pycodestyle's architecture and/or the tokenize module to figure out whether this is, in fact, a CPython bug at all -- and if it is, to produce a minimal repro that doesn't depend on pycodestyle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes OS-windows type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants