Skip to content

gh-72680: Fix false positives when using zipfile.is_zipfile() #5053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jjolly
Copy link
Contributor

@jjolly jjolly commented Dec 30, 2017

Fix zipfile validation issue by ... providing more validation!

Originally, zipfile.is_zipfile() only checked the End Central Directory
signature. If the signature could be found in the last 64k of the file,
success! This produced false positives on any file with 'PK\x05\x06' in the
last 64k of the file - including PDFs and PNGs.

This is now corrected by actually validating the Central Directory location
and size based on the information provided by the End Central Directory
along with verifying the Central Directory signature of the first entry.

This should be sufficient for the vast number of zipfiles, but more could be
done to absolutely validate the zipfile content - such as validating all
local file headers and Central Directory entries.

https://bugs.python.org/issue28494

@gpshead
Copy link
Member

gpshead commented Jan 30, 2019

Can you expand the unit testing for is_zipfile? is_zipfile should effectively suggest that zipfile.ZipFile will be able to read the archive and in particular a False return value should always mean that zipfile.ZipFile would fail to read it (ie: we don't want is_zipfile to fail something but be a file we could've read). We don't have much more than very basic coverage of this in the test_zipfile.py and test_zipfile64.py today.

It would not be unreasonable to assertTrue and assertFalse as appropriate the result of is_zipfile() on any test .zip file that a test is using (writing, reading, or both), valid or invalid, and any zip file contents being opened for read.

The zipfile.is_zipfile function would only search for the EndOfZipfile
section header. This failed to correctly identify non-zipfiles that
contained this header. Now the zipfile.is_zipfile function verifies
the first central directory entry.

Changes:
* Extended zipfile.is_zipfile to verify zipfile catalog
* Added tests to validate failure of binary non-zipfiles
@jjolly jjolly force-pushed the is_zipfile_verify branch from 2c1b3c3 to 2da9363 Compare June 1, 2019 20:56
@jjolly
Copy link
Contributor Author

jjolly commented Jun 2, 2019

Can you expand the unit testing for is_zipfile? is_zipfile should effectively suggest that zipfile.ZipFile will be able to read the archive and in particular a False return value should always mean that zipfile.ZipFile would fail to read it (ie: we don't want is_zipfile to fail something but be a file we could've read). We don't have much more than very basic coverage of this in the test_zipfile.py and test_zipfile64.py today.

I've added a PNG file that would succeed under the old algorithm, but fails correctly with the new algorithm. Perhaps the PNG is overkill, but it's proof that a valid format for another file should fail as a ZIP archive.

It would not be unreasonable to assertTrue and assertFalse as appropriate the result of is_zipfile() on any test .zip file that a test is using (writing, reading, or both), valid or invalid, and any zip file contents being opened for read.

I've added a few checks throughout the test_zipfile.py script. If you see others you would like to check, let me know and I can add them.

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Aug 30, 2019
Summary:
The default implementation is lenient in that it recognizes a zipfile if the magic number appears anywhere in the archive. So if someone has the bytes `PK\x03\x04` in a tensor, it gets recognized as a zipfile. See https://bugs.python.org/issue28494

This implementation only checks the first 4 bytes of the file for the zip magic number. We could also copy python/cpython#5053 fix, but that seems like overkill.

Fixes #25214
](https://our.intern.facebook.com/intern/diff/17102516/)
Pull Request resolved: #25279

Pulled By: driazati

Differential Revision: D17102516

fbshipit-source-id: 4d09645bd97e9ff7136a2229fba1d9a1bce5665a
@gpshead gpshead self-assigned this Sep 9, 2019
@gpshead
Copy link
Member

gpshead commented Sep 9, 2019

FYI - This PR as is causes zipfile.is_zipfile to start rejecting some valid zipfiles that zipfile.ZipFile() happily opens and properly processes. I'm looking into why and how to best update the test suite to reveal this...

@gpshead gpshead added type-feature A feature request or enhancement stale Stale PR or inactive for long period of time. labels Apr 20, 2022
@gpshead gpshead removed their assignment Apr 20, 2022
@gpshead gpshead self-requested a review April 20, 2022 22:58
@thomaswaldmann thomaswaldmann mannequin mentioned this pull request Apr 10, 2022
@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Jul 29, 2022
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Feb 23, 2023
@arhadthedev
Copy link
Member

I'm looking into why and how to best update the test suite to reveal this...

@gpshead ping

@arhadthedev arhadthedev changed the title bpo-28494: Fix false positives when using zipfile.is_zipfile() gh-72680: Fix false positives when using zipfile.is_zipfile() Feb 25, 2023
@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Feb 26, 2023
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review DO-NOT-MERGE stale Stale PR or inactive for long period of time. type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants