Reduce memory usage when updating ZipArchives #102704

edwardneal · 2024-05-26T21:30:26Z

There are a handful of different issues surrounding ZipArchive, so this PR touches on a few. It makes ZipArchive more conservative when writing in WriteFile. It does this by tracking the type of changes to each archive entry and of the archive itself.

Previously, every ZipArchiveEntry's data would be loaded into memory, the output stream would have its length set to zero, then the archive entries would be written out in sequence. This naturally causes problems when working with very large ZIP files.

I've changed this behaviour by forcing the ZipArchive to track the offset its first deleted entry, _firstDeletedEntryOffset. When writing a file:

Find the offset of the first entry with any changes at all (startingOffset), and the first entry with changes to its dynamic-length metadata or entry data (completeRewriteStartingOffset).
If any entries have an offset greater than startingOffset, add them to the list (entriesToWrite) of entries to persist.
If any entries have an offset greater than completeRewriteStartingOffset, load their contents into memory.
Position the stream offset at startingOffset, then start writing each entry in entriesToWrite. When writing:
- If the entry has changed at all, (or beyond completeRewriteStartingOffset) write the header. If not, seek past it.
- If the entry has changes to its data, write the data (loaded in an earlier step). If not, seek past it.
If necessary, write (or seek past) each entry's central directory header. Track whether or not any of these were written.
If any central directory headers were written (or if there are no entries in the archive), write the end-of-central-directory block. If not, seek past it.
Shrink the stream to its current position.

This relies upon the list of the ZipArchive's existing ZipArchiveEntry records being sorted in offset order, which is now guaranteed on load.

Issue links:

Resolves Add support for adding new entries to large zip archives #49149: appending a new entry to the ZipArchive is now "free", as long as nothing else has changed in there.
Contributes to System.IO.Compression: ZipArchive loads entire file in memory on .Dispose #1543: the original triage was that should limit ourselves to only loading entries which have changes. This PR doesn't do this - it loads the changed entry and everything after it, since the changed entry could easily write data which would extend it into the next entry. I've made an improvement, but I think a complete fix will be more complex (and might look quite like a compacting heap.)
Resolves Bug? - ZipArchive modifies underlying stream even if no archive changes made #34506: if there are no changes to the ZipArchive, nothing is written.

There's also an issue (1544) which relates to ZipArchiveEntry storing uncompressed data in memory unnecessarily. This PR doesn't fix that issue, but it means that somebody won't fix it and discover that ZipArchive.Dispose is loading the contents into memory anyway.

I've added a number of test cases which cover the corner cases I've thought of, verifying the correct number of writes and the contents of the files as they go. It's quite a core part of ZipArchive though and I'm conscious that a number of libraries depend upon it, so I'm open to adding more.

Edit: From benchmarking the "append case", my results were pretty much as I expected: performance improves, but becomes more variable. Updating/deleting entries at the end of an archive (or after the largest entries in the archive) becomes faster and uses less memory, while updating/deleting entries at the start of an archive takes about as long as it did before.


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3593/23H2/2023Update/SunValley3)
Intel Core i7-8565U CPU 1.80GHz (Whiskey Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]     : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX2
  Job-QKCECQ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Toolchain=CoreRun

Benchmark	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
Pre-PR	5.128 s	0.1346 s	0.3925 s	1000.0000	1000.0000	1000.0000	2 GB
Post-PR	8.169 ms	0.2386 ms	0.7036 ms				7.24 KB

WriteFile now takes a more granular approach to writing the ZipArchive to its stream. Adding a new file to the end of the archive will no longer require every file in the archive to be loaded into memory. Changing fixed-length metadata inside a file will no longer require the entire archive to be re-written.

ericstj

I like the premise of making ZipArchive writes more incremental - would like someone from @dotnet/area-system-io-compression to have a look and see if they can give more guidance for direction in what they'd like to see in this PR. Added a couple comments around what stuck out.

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

Renamed Dirty and DirtyState to Changed and ChangeState. Explicitly assigned Unchanged as a zero-value ChangeState.

Reset _originallyInArchive and _offsetOfLocalHeader to private members, exposed as internal properties. Also changed _versionToExtract to be private - this isn't ever used in System.IO.Compression outside of ZipArchiveEntry.

ericstj · 2024-07-22T21:15:17Z

@carlossanlop can you please review?

udlose · 2024-10-08T20:10:37Z

@carlossanlop @ericstj

is there any update on this?
possibly coming in .net 9?

ericstj · 2024-10-09T20:44:18Z

It's too late for .NET 9 but we'll have a look at getting into main for .NET 10.0. There's potential fallout from this change and we need to make sure it gets a closer look and plenty of time to bake before going out stable.

edwardneal · 2024-10-10T06:50:49Z

Thanks for the update @ericstj.

To expand on the note in the original description:

I've made an improvement, but I think a complete fix will be more complex (and might look quite like a compacting heap.)

This PR makes an improvement improvement to the average case and resolves the case of appending new entries to the ZIP archive, but it's limited: if the user changes the contents of the first entry in the archive, WriteFile will write the entire archive again - even if the entry content changes have resulted in the entry becoming smaller. The same thing will happen if the user renames the file, even if the encoded name is shorter.

A mechanism similar to a compacting heap would fix this, but I felt that needed more design work. The primary tradeoff which I was thinking about was between unused space in the archives, and the time/IO required to start moving entries around (if that's even possible - ZipArchive deals with unseekable streams.) That tradeoff naturally changes as we move towards the end of an archive, since the IO to append a large entry to the end of an archive is less expensive than the IO which reshuffles enough entries to make free space at the start.

If there's already a data structure which handles this case in .NET and which accounts for the cost of IO, then this PR can use that directly and reduce the risk.

steveharter · 2024-10-17T19:29:11Z

From the description's first step:

Find the offset of the first entry with any changes at all (startingOffset), and the first entry with changes to its dynamic-length metadata or entry data (dynamicDirtyStartingOffset).

I assume then that all of the entries from that first changed entry to the last entry will be re-written and thus have no fragmentation and thus be the same physically on disk as before the PR? If so, then we shouldn't need to add a new enum value to ZipArchiveMode but would just need to update the documentation for ZipArchiveMode.Append where it says "The content of the entire archive is held in memory."

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

* Used named parameter when passing a constant to forceWrite. * Replaced two magic numbers with constants. * Small cleanup of ZipArchive.WriteFile for clarity. * Renamed ZipArchiveEntry.Changed to Changes.

edwardneal · 2024-10-20T18:39:26Z

Thanks @steveharter - all of the changes in your review look good, I've rolled them forward. I've also brought the PR up to date, since it pre-dates quite a bit of the zlib changes and the .NET 9 branch.

I assume then that all of the entries from that first changed entry to the last entry will be re-written and thus have no fragmentation and thus be the same physically on disk as before the PR?

That's correct, for changes to an entry's contents or to its dynamic-length metadata (name/content/etc.) An existing entry's ExternalAttributes or LastWriteTime could be changed, and since these are fixed-length fields ZipArchiveEntry will simply rewrite the header as-is; if the only change to an entry is a fixed-length field, subsequent entries won't be rewritten.

The behaviour of ZipArchiveMode.Update is covered by another issue (#1544) with an API proposal to help mitigate this (#101243). However, this PR's loosely related to it. Opening a ZipArchive in Update mode currently triggers two pieces of behaviour:

When opening a ZipArchiveEntry's stream, the entire entry is loaded into memory.
When disposing the ZipArchive, every entry is loaded into memory and written back out to the stream.

This PR only deals with the second behaviour. The post-PR documentation should be updated - we no longer guarantee that the entire archive will be held in memory. Instead, we guarantee that every entry which we open (and every entry which follows this in the file) will be loaded into memory. The wording of this is a little tricky though. Perhaps something similar to this?

When you set the mode to Update, the underlying file or stream must support reading, writing, and seeking. The content of every modified entry (and every entry which follows) in the archive is held in memory, and no data is written to the underlying file or stream until the archive is disposed.

steveharter

PTAL area owners @dotnet/area-system-io-compression.

This LGTM. I don't see any potential breaking changes. However, a doc issue should be created as discussed to update the wording around the whole zip will be held in memory.

Thanks @edwardneal

carlossanlop

Thank you so much for this change, @edwardneal . It's going to improve this code a lot. I'll be curious to see if our perf runs capture any significant perf gains out of this new code that should skip entries that don't need to be rewritten. I'd like to assume our benchmarks touch this code at some point. We'll see. 🤞🏼

I left some comments for you to consider, as well as some questions. Can you please take a look?

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs

src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

The list of entries in a ZipArchive is now only sorted when opened in Update mode. Added/modified a test to verify that these entries appear in the correct order: offset order when opened in Update mode, central directory entry order when opened in Read mode.

carlossanlop · 2025-01-22T02:38:57Z

Thanks for your latest changes. I don't have any more new feedback but I'm still consulting with some folks on ideas about that sort. Meanwhile I'll run some extra CI runs to see if it all looks good on mobile platforms. I'll come back to you when I get some more answers.

I mentioned this in your other PR, but mentioning it here too: we can try to aim for Preview1 to merge this change. Code Complete day is Monday January 27th. One thing to keep in mind is that I expect a bunch of merge conflicts with your other PR, so let's get ready for that too.

carlossanlop · 2025-01-22T02:39:06Z

/azp run runtime-extra-platforms

azure-pipelines · 2025-01-22T02:39:18Z

Azure Pipelines successfully started running 1 pipeline(s).

edwardneal · 2025-01-22T07:28:57Z

Thanks carlossanlop. I've checked the logs for the extra runs - there are failures in the same area on the linux-x64 Release Libraries_Release_CoreCLR leg, but it looks like this is related to a missing glibc version:

Failed to load /root/helix/work/correlation/shared/Microsoft.NETCore.App/10.0.0/libcoreclr.so, error: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /root/helix/work/correlation/shared/Microsoft.NETCore.App/10.0.0/libcoreclr.so)
Failed to bind to CoreCLR at '/root/helix/work/correlation/shared/Microsoft.NETCore.App/10.0.0/'
Failed to create CoreCLR, HRESULT: 0x80008088

carlossanlop · 2025-01-23T00:01:06Z

@edwardneal I talked with our security expert @GrabYourPitchforks about any concerns regarding the sort. Here's the summary of the conversation:

The sorting does not seem concerning, because a List<ZipArchiveEntry> will sort the object references under the covers, not the full objects in memory.
The sort has a complexity of O(n log n) as documented in the Remarks section of Sort(Comparion<T>).
The sort is unstable, meaning that if two elements are equal, their order might not get preserved (also documented in Remarks). But since we're sorting unique numbers (the offset of each entry in the header), we don't have to worry about this limitation.

We did discuss a behavior that needs verification: if EnsureCentralDirectoryRead is only called when the archive is opened, then _readEntries will only contain the entries that were originally in the archive. What would happen if we open the archive in Update mode, modify the variable-size fields of a few existing entries, but also add some new entries?

I see that in the proposed code, the sort happens in EnsureCentralDirectoryRead (existing entries), which means that any added entries will get appended at the end of _readEntries and they should get written at the end of the archive.

Unfortunately, there's no test verifying this behavior: opening in Update mode, modifying entries, and adding entries. Can you please add a test for this?

Aside from that final request, and running runtime-extra-platforms again, I think we would be good to merge.

carlossanlop · 2025-01-23T00:19:25Z

Can you please merge conflicts since I just merged your other PR? 😄

edwardneal · 2025-01-23T05:36:25Z

Thanks both for the review - that sounds good to me! I'll address the merge conflicts today and add the extra test.

This accounts for the removal of BinaryReader in an earlier PR

This test modifies an entry at a specific index, adds N entries after it, then verifies that they've been written in the expected order: existing entries first, new entries afterwards

carlossanlop · 2025-01-23T22:27:03Z

/azp run runtime-extra-platforms

azure-pipelines · 2025-01-23T22:27:15Z

Azure Pipelines successfully started running 1 pipeline(s).

carlossanlop

LGTM. Assuming no {related} CI issues, we can merge. Thanks @edwardneal!

edwardneal · 2025-01-24T05:36:09Z

That's great, thanks for your reviews @carlossanlop. I've checked the test results and they look unrelated to me:

StringTests.StartsWithNoMatch_StringComparison on tvOS
Shake128Tests, Shake256Tests on Azure Linux
Regular exit codes of 137 due to an incorrect version of glibc

edwardneal added 2 commits May 26, 2024 18:15

Added tests for changed WriteFile behaviour

9c03bef

dotnet-issue-labeler bot added the area-System.IO.Compression label May 26, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 26, 2024

edwardneal mentioned this pull request May 29, 2024

Add Flush to ZipArchive #24149

Open

edwardneal mentioned this pull request Jun 7, 2024

Remove BinaryReader and BinaryWriter references from ZipArchive #103153

Merged

ericstj requested review from a team and carlossanlop June 24, 2024 15:34

ericstj reviewed Jun 24, 2024

View reviewed changes

edwardneal added 2 commits June 25, 2024 07:05

Changes following code review

03f7fcd

Renamed Dirty and DirtyState to Changed and ChangeState. Explicitly assigned Unchanged as a zero-value ChangeState.

Changed field protection levels

15eb743

Reset _originallyInArchive and _offsetOfLocalHeader to private members, exposed as internal properties. Also changed _versionToExtract to be private - this isn't ever used in System.IO.Compression outside of ZipArchiveEntry.

ericstj assigned carlossanlop Sep 30, 2024

carlossanlop added this to the 10.0.0 milestone Sep 30, 2024

steveharter reviewed Oct 17, 2024

View reviewed changes

edwardneal added 2 commits October 20, 2024 16:29

Merge remote-tracking branch 'upstream/main' into issue-1543-triage

169017d

Following code review feedback

84dcd75

* Used named parameter when passing a constant to forceWrite. * Replaced two magic numbers with constants. * Small cleanup of ZipArchive.WriteFile for clarity. * Renamed ZipArchiveEntry.Changed to Changes.

This was referenced Oct 20, 2024

The hosted runner encountered an error while running your job. (Error Type: Disconnect). dotnet/dnceng#1919

Open

WasmTestOnChrome timeouts in CI #105363

Closed

System.Net.Http.Functional.Tests.TelemetryTest failing with differing string #109024

Open

steveharter approved these changes Nov 11, 2024

View reviewed changes

edwardneal mentioned this pull request Dec 6, 2024

System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory #1544

Open

carlossanlop reviewed Dec 10, 2024

View reviewed changes

edwardneal added 2 commits December 11, 2024 19:47

Merge branch 'main' into issue-1543-triage

d92d70a

Further code review: initial response

448aa4a

This was referenced Dec 12, 2024

Crash in Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests.WorkItemExecution #90019

Open

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Open

This was referenced Dec 19, 2024

[browser][MT] System.Net.Http.Functional.Tests failure ("Why are we setting the target on an unoccupied slot?") #104524

Open

WasmTestOnChrome-MT-System.Net.Http.Functional.Tests timeout #106094

Open

Merge branch 'main' into issue-1543-triage

e5ffc4f

edwardneal added 3 commits January 23, 2025 20:45

Merged main

b89b642

Correct the write counts in tests

ccb4a83

This accounts for the removal of BinaryReader in an earlier PR

Extra test: Update_PerformMinimalWritesWhenEntriesModifiedAndAdded

226ec0d

This test modifies an entry at a specific index, adds N entries after it, then verifies that they've been written in the expected order: existing entries first, new entries afterwards

carlossanlop approved these changes Jan 23, 2025

View reviewed changes

carlossanlop merged commit 0a477a8 into dotnet:main Jan 24, 2025
119 of 126 checks passed

edwardneal deleted the issue-1543-triage branch January 24, 2025 13:01

edwardneal mentioned this pull request Jan 24, 2025

System.IO.Compression: ZipArchive loads entire file in memory on .Dispose #1543

Open

carlossanlop mentioned this pull request Jan 30, 2025

[.NET 10 Preview 1] System.IO.Compression.ZipArchive produces subtly incorrect zip headers #112017

Open

carlossanlop mentioned this pull request Feb 13, 2025

Add async ZipFile APIs #1541

Open

15 tasks

github-actions bot locked and limited conversation to collaborators Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage when updating ZipArchives #102704

Reduce memory usage when updating ZipArchives #102704

edwardneal commented May 26, 2024 •

edited

Loading

ericstj left a comment

ericstj commented Jul 22, 2024

udlose commented Oct 8, 2024 •

edited

Loading

ericstj commented Oct 9, 2024

edwardneal commented Oct 10, 2024

steveharter commented Oct 17, 2024

edwardneal commented Oct 20, 2024

steveharter left a comment

carlossanlop left a comment

carlossanlop commented Jan 22, 2025

carlossanlop commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

edwardneal commented Jan 22, 2025

carlossanlop commented Jan 23, 2025 •

edited

Loading

carlossanlop commented Jan 23, 2025

edwardneal commented Jan 23, 2025

carlossanlop commented Jan 23, 2025

azure-pipelines bot commented Jan 23, 2025

carlossanlop left a comment •

edited

Loading

edwardneal commented Jan 24, 2025

Reduce memory usage when updating ZipArchives #102704

Reduce memory usage when updating ZipArchives #102704

Conversation

edwardneal commented May 26, 2024 • edited Loading

ericstj left a comment

Choose a reason for hiding this comment

ericstj commented Jul 22, 2024

udlose commented Oct 8, 2024 • edited Loading

ericstj commented Oct 9, 2024

edwardneal commented Oct 10, 2024

steveharter commented Oct 17, 2024

edwardneal commented Oct 20, 2024

steveharter left a comment

Choose a reason for hiding this comment

carlossanlop left a comment

Choose a reason for hiding this comment

carlossanlop commented Jan 22, 2025

carlossanlop commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

edwardneal commented Jan 22, 2025

carlossanlop commented Jan 23, 2025 • edited Loading

carlossanlop commented Jan 23, 2025

edwardneal commented Jan 23, 2025

carlossanlop commented Jan 23, 2025

azure-pipelines bot commented Jan 23, 2025

carlossanlop left a comment • edited Loading

Choose a reason for hiding this comment

edwardneal commented Jan 24, 2025

edwardneal commented May 26, 2024 •

edited

Loading

udlose commented Oct 8, 2024 •

edited

Loading

carlossanlop commented Jan 23, 2025 •

edited

Loading

carlossanlop left a comment •

edited

Loading