Tokenizer produces different output on Windows on py312 for ends of files #105017

AlexWaygood · 2023-05-27T13:14:10Z

Bug report

If you copy and paste the following code into a repro.py file and run python -m tokenize on it on a Windows machine, the output is different on 3.12/3.13 compared to what it was on 3.11 (the file ends with a single newline):

foo = 'bar'
spam = 'eggs'

On Python 3.11, on Windows, the output is this:

> python -m tokenize cpython/repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,13:          NEWLINE        '\r\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,15:          NEWLINE        '\r\n'
3,0-3,0:            ENDMARKER      ''

On Python 3.13 (@ 6e62eb2) on Windows, however, the output is this:

> python -m tokenize repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,14:          NEWLINE        '\n'
3,0-3,1:            NL             '\n'
4,0-4,0:            ENDMARKER      ''

There appear to be two changes here:

All the NEWLINE tokens now have \n values, whereas on Python 3.11 they all had \r\n values
There is an additional NL token at the end, immediately before the ENDMARKER token.

As discussed in PyCQA/pycodestyle#1142, this appears to be Windows-specific, and may be the cause of a large number of spurious W391 errors from the pycodestyle linting tool. (W391 dictates that there should be one, and only one, newline at the end of a file.) The pycodestyle tool is included in flake8, meaning that test failures in pycodestyle can cause test failures for other flake8 plugins. (All tests for the flake8-pyi plugin, for example, currently fail on Python 3.13 on Windows.)

Your environment

Python 3.13.0a0 (heads/main:6e62eb2e70, May 27 2023, 14:00:13) [MSC v.1932 64 bit (AMD64)] on win32

Linked PRs

The text was updated successfully, but these errors were encountered:

pablogsal · 2023-05-27T13:24:51Z

CC: @mgmacias95

AlexWaygood · 2023-05-27T16:41:24Z

#105022 fixes the issue of the additional NL token at the end of the file on Windows, but there are still differences on Windows between what the tokenizer produced on 3.11 and what it produces on #105022.

On 3.11 on Windows:

>python -m tokenize cpython/repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,13:          NEWLINE        '\r\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,15:          NEWLINE        '\r\n'
3,0-3,0:            ENDMARKER      ''

Using #105022:

> python -m tokenize repro.py
0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'foo'
1,4-1,5:            OP             '='
1,6-1,11:           STRING         "'bar'"
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            NAME           'spam'
2,5-2,6:            OP             '='
2,7-2,13:           STRING         "'eggs'"
2,13-2,14:          NEWLINE        '\n'
3,0-3,0:            ENDMARKER      ''

Note that the same tokens are emitted now as were emitted on 3.11, but the end-column coordinate is off by one for each NEWLINE token.

(I'm still seeing loads of flake8-pyi test failures even using #105022, and I'm presuming this is the cause?)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…ythonGH-105022) (cherry picked from commit 86d8f48) Co-authored-by: Marta Gómez Macías <mgmacias@google.com> Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…H-105022) (#105023) Co-authored-by: Marta Gómez Macías <mgmacias@google.com> Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

pablogsal · 2023-05-28T14:08:43Z

@AlexWaygood can you check again with the new PR (#105030)?

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…thonGH-105030) (cherry picked from commit 96fff35) Co-authored-by: Marta Gómez Macías <mgmacias@google.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

…H-105030) (#105041) gh-105017: Include CRLF lines in strings and column numbers (GH-105030) (cherry picked from commit 96fff35) Co-authored-by: Marta Gómez Macías <mgmacias@google.com> Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

AlexWaygood · 2023-05-28T17:47:02Z

@AlexWaygood can you check again with the new PR (#105030)?

Will check either this evening or tomorrow

AlexWaygood · 2023-05-28T19:35:58Z

Well, I have good news and bad news.

The good news is that, on Windows, the tokenize module appears to now be producing exactly the same output on main as it did on 3.11.

The bad news is that the spurious pycodestyle W391 errors discussed in PyCQA/pycodestyle#1142 still haven't gone away, even using CPython main. These definitely bisect to 6715f91, so something else must be going on to cause these failures :(

Regardless, the original bug reported here has definitely been fixed, so I'll close this for now, and I'll open a new issue once I've done some more investigation into what might be causing the spurious W391 errors and (hopefully) found a minimal repro that doesn't involve third-party code. Thanks @pablogsal and @mgmacias95 for all your hard work on this!

AlexWaygood · 2023-05-28T20:52:58Z

I'll open a new issue once I've done some more investigation into what might be causing the spurious W391 errors and (hopefully) found a minimal repro that doesn't involve third-party code. Thanks @pablogsal and @mgmacias95 for all your hard work on this!

I've tried to investigate, but I can't immediately figure out what might be causing these W391 errors. I suspect it'll need somebody more familiar with pycodestyle's architecture and/or the tokenize module to figure out whether this is, in fact, a CPython bug at all -- and if it is, to produce a minimal repro that doesn't depend on pycodestyle.

AlexWaygood added type-bug An unexpected behavior, bug, or error OS-windows 3.12 only security fixes 3.13 bugs and security fixes labels May 27, 2023

AlexWaygood assigned pablogsal and lysnikolaou May 27, 2023

AlexWaygood changed the title ~~Tokenizer produces different output on Windows for ends of files~~ Tokenizer produces different output on Windows on py312 for ends of files May 27, 2023

AlexWaygood mentioned this issue May 27, 2023

W391: spurious warnings with python 3.12 beta PyCQA/pycodestyle#1142

Closed

bedevere-bot mentioned this issue May 27, 2023

gh-105017: Fix including additional NL token when using CRLF #105022

Merged

pablogsal added a commit that referenced this issue May 27, 2023

gh-105017: Fix including additional NL token when using CRLF (#105022)

86d8f48

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

bedevere-bot mentioned this issue May 27, 2023

[3.12] gh-105017: Fix including additional NL token when using CRLF (GH-105022) #105023

Merged

pablogsal added a commit that referenced this issue May 27, 2023

[3.12] gh-105017: Fix including additional NL token when using CRLF (G…

2b176bc

…H-105022) (#105023) Co-authored-by: Marta Gómez Macías <mgmacias@google.com> Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

bedevere-bot mentioned this issue May 27, 2023

gh-105017: Include CRLF lines in strings and column numbers #105030

Merged

pablogsal added a commit that referenced this issue May 28, 2023

gh-105017: Include CRLF lines in strings and column numbers (#105030)

96fff35

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

bedevere-bot mentioned this issue May 28, 2023

[3.12] gh-105017: Include CRLF lines in strings and column numbers (GH-105030) #105041

Merged

AlexWaygood closed this as completed May 28, 2023

erlend-aasland mentioned this issue Jun 6, 2023

3.12 backport gh 105236 #105358

Closed

Erotemic mentioned this issue Oct 23, 2023

Tokenize generate_tokens regression in CPython 3.12 #111224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer produces different output on Windows on py312 for ends of files #105017

Tokenizer produces different output on Windows on py312 for ends of files #105017

AlexWaygood commented May 27, 2023 •

edited by bedevere-bot

Loading

pablogsal commented May 27, 2023

AlexWaygood commented May 27, 2023 •

edited

Loading

pablogsal commented May 28, 2023 •

edited

Loading

AlexWaygood commented May 28, 2023

AlexWaygood commented May 28, 2023

AlexWaygood commented May 28, 2023

Tokenizer produces different output on Windows on py312 for ends of files #105017

Tokenizer produces different output on Windows on py312 for ends of files #105017

Comments

AlexWaygood commented May 27, 2023 • edited by bedevere-bot Loading

Bug report

Your environment

Linked PRs

pablogsal commented May 27, 2023

AlexWaygood commented May 27, 2023 • edited Loading

pablogsal commented May 28, 2023 • edited Loading

AlexWaygood commented May 28, 2023

AlexWaygood commented May 28, 2023

AlexWaygood commented May 28, 2023

AlexWaygood commented May 27, 2023 •

edited by bedevere-bot

Loading

AlexWaygood commented May 27, 2023 •

edited

Loading

pablogsal commented May 28, 2023 •

edited

Loading