-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-126834: Properly read zip64 archives with non-empty zip64 extensible data sector in Zip64 end of central directory record #126841
base: main
Are you sure you want to change the base?
Conversation
…tor in Zip64 end of central directory record Fixes python#126834
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Lib/zipfile/__init__.py
Outdated
@@ -270,8 +270,7 @@ def _EndRecData64(fpin, offset, endrec): | |||
if diskno != 0 or disks > 1: | |||
raise BadZipFile("zipfiles that span multiple disks are not supported") | |||
|
|||
# Assume no 'zip64 extensible data' | |||
fpin.seek(offset - sizeEndCentDir64Locator - sizeEndCentDir64, 2) | |||
fpin.seek(reloff, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think struct.unpack
returns relative to the struct, and that will be at offset - sizeEndCentDir64Locator
(where it was read from). SEEK_SET / 0 isn't right as reloff
is a relative position from where the struct was read, not an absolute position in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the variable is named reloff
and the spec states "relative offset of the zip64 end of central directory record" without specifying relative to what, in reality it's offset from the beginning of the file. See code which writes it
cpython/Lib/zipfile/__init__.py
Line 2078 in 2313f84
stringEndArchive64Locator, 0, pos2, 1) |
Also, the same in the libzip code: https://github.com/nih-at/libzip/blob/d0ebf7fa268ae2e59e575cb3a72e6bc259e3fdd8/lib/zip_open.c#L853
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmaloney wdyt on renaming reloff
, worth changing to e.g. eocd_offset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely would be clearer to me, but not sure it's worth the extra noise in the diff though / additional lines changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offsets are complicated by data that may precede the start of the zip content which is why some of the tests are failing. reloff
is the number of bytes from the start of the first local file header in the zip which may not be the actual start of the file. I can't think of a good way of directly computing the start of the zip64 end of record block if there's data preceding the start of the zip part of the file. Might have to do a bit of a search.
# Assume no 'zip64 extensible data'
fpin.seek(offset - sizeEndCentDir64Locator - sizeEndCentDir64, 2)
data = fpin.read(sizeEndCentDir64)
if len(data) != sizeEndCentDir64:
return endrec
sig, sz, create_version, read_version, disk_num, disk_dir, \
dircount, dircount2, dirsize, diroffset = \
struct.unpack(structEndArchive64, data)
if sig != stringEndArchive64:
loc_pos = fpin.tell()
# Seek to the earliest possible eocd64 start
fpin.seek(reloff, 0)
# Read all the data between here and the eocd64 locator
data = fpin.read(loc_pos - reloff)
start = data.rfind(stringEndArchive64)
if start >= 0 and len(data) - start > sizeEndCentDir64:
sig, sz, create_version, read_version, disk_num, disk_dir, \
dircount, dircount2, dirsize, diroffset = \
struct.unpack(structEndArchive64, data[start:start+sizeEndCentDir64])
if sig != stringEndArchive64:
return endrec
else:
return endrec
My code needs improvement as data = fpin.read(loc_pos - reloff)
might read a substantial amount of data if there's a big blob of data before the zip. It would also be a good idea to check that the sizes of the offsets are consistent with regards to:
- the zip64 end locator is actually
sz
bytes from the position just after thesz
field - The position of the
stringEndArchive64
signature found usingrfind
is the same position as thestringEndArchive64
signature that is found directly after the central directory.
…ata sector in Zip64 end of central directory record
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
This will need a news entry describing the feature it's adding, the CI / PR testing failures look related to the code changes and need to get resolved |
Fixes #126834