Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 3, 2026

main PR

Description

ZipArchive produces corrupted ZIP files when a file >4GB is written at an offset >4GB. In WriteCentralDirectoryFileHeaderInitialize, the Zip64ExtraField for sizes was being overwritten when setting the offset:

// Before: overwrites sizes with new object
zip64ExtraField = new() { LocalHeaderOffset = _offsetOfLocalHeader };

// After: preserves sizes, adds offset
zip64ExtraField ??= new();
zip64ExtraField.LocalHeaderOffset = _offsetOfLocalHeader;

This caused 7-Zip to show Extra_ERROR Zip64_ERROR and ZipFile.OpenRead to throw InvalidDataException: A local file header is corrupt.

Customer Impact

ZIP archives containing large files (>4GB) positioned after >4GB of preceding data become unreadable. Affects backup/archive scenarios with large datasets.

Regression

No. This is a longstanding bug in the ZIP64 handling logic.

Testing

  • Added regression test LargeFile_At_LargeOffset_ZIP64_HeaderPreservation covering the specific scenario
  • Test includes OOM exception handling with SkipTestException to gracefully skip when memory is insufficient
  • All 1359 existing System.IO.Compression tests pass

Risk

Low. Single-line logic change in a specific code path that only affects ZIP64 central directory header writing when both sizes and offset exceed 4GB.

Package authoring no longer needed in .NET 9

IMPORTANT: Starting with .NET 9, you no longer need to edit a NuGet package's csproj to enable building and bump the version.
Keep in mind that we still need package authoring in .NET 8 and older versions.

Original prompt

This section details on the original issue you should resolve

<issue_title>ZipArchive creates corrupted ZIP when writing large dataset with many repeated files</issue_title>
<issue_description>### Description

RavenDB snapshot backups produced with ZipArchive can be unrecoverable due to ZIP header corruption. The issue is that producing a snapshot backup which is ZIP archive with System.IO.Compression.ZipArchive over a specific data set result in ZIP fails to open correctly:

  • 7‑Zip shows Extra_ERROR Zip64_ERROR: UTF8 (for entry Documents\Raven.voron), and the Packed Size looks capped at 4GB.
Image
  • System.IO.Compression.ZipFile.OpenRead(...).Entries[i].Open() throws System.IO.InvalidDataException: A local file header is corrupt.

Writing the exact same dataset and order using SharpZipLib’s ZipOutputStream produces a valid ZIP that both 7‑Zip and ZipFile.OpenRead can read.

This started affecting us after introducing a feature that creates many per-index journal files that are hard links to the same underlying file content (so multiple distinct file paths share the exact same bytes on disk). Our dataset also includes a large 30GB file (Raven.voron). The combination seems to trigger a bug.

Reproduction Steps

Repro dataset

> $RootPath = (Get-Item .).FullName; Get-ChildItem -Path . -Include *.journal -Recurse -File | Get-FileHash | Select-Object @{Name='Path'; Expression={ $_.Path.Replace($RootPath + "\", "") }}, Hash, Algorithm

Path                                                         Hash                                                             Algorithm
----                                                         ----                                                             ---------
Configuration\Journals\0000000000000000001.journal           96F77B06EBF13895A297B7182BC162B42A05CC9B444D488A87FA541CD9962516 SHA256
Indexes\@SharedJournals\Journals\0000000000000000107.journal 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Activity_ByMonth\Journals\0000000000000000008.jou... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Questions_Search\Journals\0000000000000000004.jou... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Questions_Tags\Journals\0000000000000000007.journal  16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Questions_Tags_ByMonths\Journals\0000000000000000... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Users_Registrations_ByMonth\Journals\000000000000... 16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256
Indexes\Users_Search\Journals\0000000000000000005.journal    16BB9C3A617844EFA25254184A4AF7E0E36ED0B656C12A71952D28F3EE2C3156 SHA256

Repro app

Single‑file console app (targets net8.0 or net10.0). It copies files from the dataset into a ZIP using ZipArchive, in the exact order RavenDB snapshot backup uses:

  • Indexes (excluding any @* folder such as @SharedJournals), then
  • Documents (root storage env), then
  • Configuration folder
// Add package: ICSharpCode.SharpZipLib
//
// Example csproj snippet:
// <ItemGroup>
//   <PackageReference Include="SharpZipLib" Version="1.4.2" />
// </ItemGroup>
//
// Usage:
//   ZipArchiveIssue <sourceDbFolder> <outputDir> [options]
//
// Options:
//   --ziparchive             Generate ZIP using System.IO.Compression.ZipArchive
//   --sharpzip               Generate ZIP using SharpZipLib ZipOutputStream
//   --level=<Optimal|Fastest|NoCompression>   Compression level (default: Optimal)
//   --nonseekable            Wrap output stream to simulate non-seekable sink (ZipArchive data-descriptor path)
//   --outname=<baseName>     Base file name (default: derived from folder name)
//   --verify                 After writing, attempt to open/read entries via ZipFile.OpenRead
//
// Mapping mirrors RavenDB snapshot shape, copying from disk:
// - Order: Indexes -> Documents -> Configuration (matches RavenDB snapshot backup)
// - Root DB env  -> Documents/
// - Configuration/ -> Configuration/
// - Indexes/<IndexName>/ -> Indexes/<IndexName>/
// - Include files: Raven.voron, headers.one, headers.two, database.metadata, Journals/*.journal
// - Skip: any Temp/ folders, and all Indexes/@* folders (e.g. @SharedJournals)

#nullable enable
using System;
using System...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes dotnet/runtime#122489

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs.

When both AreSizesTooLarge and IsOffsetTooLarge are true,
the Zip64ExtraField was being overwritten in the central
directory header logic, losing the size information.

This fixes the issue by reusing the existing Zip64ExtraField
when adding the offset, using ??= instead of creating a new object.

Fixes #114205

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix ZipArchive ZIP header corruption for large datasets Fix ZIP64 header corruption when large file at large offset Jan 3, 2026
Copilot AI requested a review from stephentoub January 3, 2026 19:35
@stephentoub stephentoub marked this pull request as ready for review January 4, 2026 00:26
Copilot AI review requested due to automatic review settings January 4, 2026 00:26
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a critical bug in ZipArchive where files larger than 4GB positioned at offsets greater than 4GB would produce corrupted ZIP files. The corruption occurred because the ZIP64 extra field handling was overwriting previously set size information when adding offset information.

Key Changes

  • Fixed the ZIP64 header preservation logic in WriteCentralDirectoryFileHeaderInitialize by using null-coalescing assignment (??=) instead of creating a new object
  • Added regression test to verify large files at large offsets are handled correctly

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs Changed line 530 to use ??= operator to preserve existing ZIP64 extra field instead of overwriting it with a new instance
src/libraries/System.IO.Compression/tests/ZipArchive/zip_LargeFiles.cs Added new test LargeFile_At_LargeOffset_ZIP64_HeaderPreservation that creates a ZIP with 5GB of small files followed by a 5GB large file to trigger both size and offset ZIP64 conditions

Wrap buffer allocation in try-catch for OutOfMemoryException
and throw SkipTestException to gracefully skip the test
when insufficient memory is available.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants