gh-126834: Properly read zip64 archives with non-empty zip64 extensible data sector in Zip64 end of central directory record #126841

VladRassokhin · 2024-11-14T20:55:54Z

Fixes #126834

Issue: Cannot open zip64 file if Zip64EOCD record has additional data #126834

…tor in Zip64 end of central directory record Fixes python#126834

cpython-cla-bot · 2024-11-14T20:55:57Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2024-11-14T20:55:58Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

cmaloney · 2024-11-14T20:59:48Z

Lib/zipfile/__init__.py

@@ -270,8 +270,7 @@ def _EndRecData64(fpin, offset, endrec):
    if diskno != 0 or disks > 1:
        raise BadZipFile("zipfiles that span multiple disks are not supported")

-    # Assume no 'zip64 extensible data'
-    fpin.seek(offset - sizeEndCentDir64Locator - sizeEndCentDir64, 2)
+    fpin.seek(reloff, 0)


I think struct.unpack returns relative to the struct, and that will be at offset - sizeEndCentDir64Locator (where it was read from). SEEK_SET / 0 isn't right as reloff is a relative position from where the struct was read, not an absolute position in the file.

Although the variable is named reloff and the spec states "relative offset of the zip64 end of central directory record" without specifying relative to what, in reality it's offset from the beginning of the file. See code which writes it

cpython/Lib/zipfile/__init__.py

Line 2078 in 2313f84

stringEndArchive64Locator, 0, pos2, 1)

Also, the same in the libzip code: https://github.com/nih-at/libzip/blob/d0ebf7fa268ae2e59e575cb3a72e6bc259e3fdd8/lib/zip_open.c#L853

@cmaloney wdyt on renaming reloff, worth changing to e.g. eocd_offset?

Definitely would be clearer to me, but not sure it's worth the extra noise in the diff though / additional lines changed.

The offsets are complicated by data that may precede the start of the zip content which is why some of the tests are failing. reloff is the number of bytes from the start of the first local file header in the zip which may not be the actual start of the file. I can't think of a good way of directly computing the start of the zip64 end of record block if there's data preceding the start of the zip part of the file. Might have to do a bit of a search.

# Assume no 'zip64 extensible data' fpin.seek(offset - sizeEndCentDir64Locator - sizeEndCentDir64, 2) data = fpin.read(sizeEndCentDir64) if len(data) != sizeEndCentDir64: return endrec sig, sz, create_version, read_version, disk_num, disk_dir, \ dircount, dircount2, dirsize, diroffset = \ struct.unpack(structEndArchive64, data) if sig != stringEndArchive64: loc_pos = fpin.tell() # Seek to the earliest possible eocd64 start fpin.seek(reloff, 0) # Read all the data between here and the eocd64 locator data = fpin.read(loc_pos - reloff) start = data.rfind(stringEndArchive64) if start >= 0 and len(data) - start > sizeEndCentDir64: sig, sz, create_version, read_version, disk_num, disk_dir, \ dircount, dircount2, dirsize, diroffset = \ struct.unpack(structEndArchive64, data[start:start+sizeEndCentDir64]) if sig != stringEndArchive64: return endrec else: return endrec

My code needs improvement as data = fpin.read(loc_pos - reloff) might read a substantial amount of data if there's a big blob of data before the zip. It would also be a good idea to check that the sizes of the offsets are consistent with regards to:

the zip64 end locator is actually sz bytes from the position just after the sz field

The position of the stringEndArchive64 signature found using rfind is the same position as the stringEndArchive64 signature that is found directly after the central directory.

Lib/zipfile/__init__.py

…ata sector in Zip64 end of central directory record

bedevere-app · 2024-11-16T09:57:34Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

cmaloney · 2024-11-18T20:00:21Z

This will need a news entry describing the feature it's adding, the CI / PR testing failures look related to the code changes and need to get resolved

Properly read zip64 archives with non-empty zip64 extensible data sec…

b0f37b3

…tor in Zip64 end of central directory record Fixes python#126834

bedevere-app bot added the awaiting review label Nov 14, 2024

bedevere-app bot mentioned this pull request Nov 14, 2024

Cannot open zip64 file if Zip64EOCD record has additional data #126834

Open

cmaloney reviewed Nov 15, 2024

View reviewed changes

fixup! Properly read zip64 archives with non-empty zip64 extensible d…

4f57dc5

…ata sector in Zip64 end of central directory record

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-126834: Properly read zip64 archives with non-empty zip64 extensible data sector in Zip64 end of central directory record #126841

gh-126834: Properly read zip64 archives with non-empty zip64 extensible data sector in Zip64 end of central directory record #126841

VladRassokhin commented Nov 14, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Nov 14, 2024 •

edited

Loading

bedevere-app bot commented Nov 14, 2024

cmaloney Nov 14, 2024

VladRassokhin Nov 16, 2024 •

edited

Loading

VladRassokhin Nov 17, 2024

cmaloney Nov 17, 2024

danifus Nov 19, 2024

bedevere-app bot commented Nov 16, 2024

cmaloney commented Nov 18, 2024

gh-126834: Properly read zip64 archives with non-empty zip64 extensible data sector in Zip64 end of central directory record #126841

Are you sure you want to change the base?

gh-126834: Properly read zip64 archives with non-empty zip64 extensible data sector in Zip64 end of central directory record #126841

Conversation

VladRassokhin commented Nov 14, 2024 • edited by bedevere-app bot Loading

cpython-cla-bot bot commented Nov 14, 2024 • edited Loading

bedevere-app bot commented Nov 14, 2024

cmaloney Nov 14, 2024

Choose a reason for hiding this comment

VladRassokhin Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

VladRassokhin Nov 17, 2024

Choose a reason for hiding this comment

cmaloney Nov 17, 2024

Choose a reason for hiding this comment

danifus Nov 19, 2024

Choose a reason for hiding this comment

bedevere-app bot commented Nov 16, 2024

cmaloney commented Nov 18, 2024

VladRassokhin commented Nov 14, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Nov 14, 2024 •

edited

Loading

VladRassokhin Nov 16, 2024 •

edited

Loading