Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.12: tokenize adds a newline when it is not there #105259

Closed
asottile opened this issue Jun 2, 2023 · 8 comments
Closed

3.12: tokenize adds a newline when it is not there #105259

asottile opened this issue Jun 2, 2023 · 8 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes release-blocker type-bug An unexpected behavior, bug, or error

Comments

@asottile
Copy link
Contributor

asottile commented Jun 2, 2023

Bug report

tokenize adds a concrete newline character when it is not present. this breaks any sort of roundtrip source generation and pycodestyle's end-of-file checker

here's an example of a file containing a single byte (generated via echo -n 1 > t.py)

# hd -c t.py 
00000000  31                                                |1|
0000000   1                                                            
0000001
# python3.12 -m tokenize t.py
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,1:            NUMBER         '1'            
1,1-1,2:            NEWLINE        '\n'           
2,0-2,0:            ENDMARKER      ''             
# python3.11 -m tokenize t.py
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,1:            NUMBER         '1'            
1,1-1,2:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      '' 

Your environment

  • CPython versions tested on: 3.12 dbd7d7c
  • Operating system and architecture: ubuntu 22.04 LTS x86_64

cc @pablogsal

Linked PRs

@asottile asottile added the type-bug An unexpected behavior, bug, or error label Jun 2, 2023
@pablogsal
Copy link
Member

pablogsal commented Jun 3, 2023

@lysnikolaou can you take a look at this?

I think this is due to this line:

/* Last line does not end in \n, fake one */

in combination with this line:

str = PyUnicode_FromString("\n");

But the real question is what is going to break if we cover the first line with if (!tok->extra_tokens) and how we can fix it.~

@AlexWaygood AlexWaygood added 3.12 bugs and security fixes 3.13 bugs and security fixes labels Jun 3, 2023
@pablogsal
Copy link
Member

pablogsal commented Jun 3, 2023

@lysnikolaou The problem is that we are always adding a new line at the end of the line buffer so we cannot distinguish if the last one is there for real or not (also, we always hardcode \n for NEWLINE tokens which we should not do for the last one).

@lysnikolaou
Copy link
Contributor

lysnikolaou commented Jun 5, 2023

The problem is that we are always adding a new line at the end of the line buffer so we cannot distinguish if the last one is there for real or not

Do we? I thought you had removed this in #104850, but I guess I misunderstood that PR. I'll have a look.

@pablogsal
Copy link
Member

pablogsal commented Jun 5, 2023

Do we? I thought you had removed this in #104850, but I guess I misunderstood that PR. I'll have a look.

That was "reverted" because apparently that was required to keep backward compatibility IIRC. But this is also a different kind of "new line": the tokenize module needs new lines at the end of every logical line to correctly emit the last tokens. Just comment out this part and see what fails (locations of last tokens are wrong as well as the some implicit dedents are missing):

/* Last line does not end in \n, fake one */

Notice that just removing that is not enough to fix this: we would also need to emit the last artificial newline if the input doesn't contain it.

@pablogsal
Copy link
Member

@Yhg1s can you wait until we fix this for the release?

@Yhg1s
Copy link
Member

Yhg1s commented Jun 6, 2023

Yeah, I can wait.

pablogsal added a commit to pablogsal/cpython that referenced this issue Jun 6, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 6, 2023
… NEWLINE tokens (pythonGH-105364)

(cherry picked from commit c0a6ed3)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
@pablogsal
Copy link
Member

@asottile can you confirm this fixes it?

@asottile
Copy link
Contributor Author

asottile commented Jun 6, 2023

yep looks fixed -- this is the only outstanding breaking change I'm encountering: #105390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes release-blocker type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

5 participants