-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage when updating ZipArchives #102704
Conversation
WriteFile now takes a more granular approach to writing the ZipArchive to its stream. Adding a new file to the end of the archive will no longer require every file in the archive to be loaded into memory. Changing fixed-length metadata inside a file will no longer require the entire archive to be re-written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the premise of making ZipArchive writes more incremental - would like someone from @dotnet/area-system-io-compression to have a look and see if they can give more guidance for direction in what they'd like to see in this PR. Added a couple comments around what stuck out.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
Renamed Dirty and DirtyState to Changed and ChangeState. Explicitly assigned Unchanged as a zero-value ChangeState.
Reset _originallyInArchive and _offsetOfLocalHeader to private members, exposed as internal properties. Also changed _versionToExtract to be private - this isn't ever used in System.IO.Compression outside of ZipArchiveEntry.
@carlossanlop can you please review? |
|
It's too late for .NET 9 but we'll have a look at getting into |
Thanks for the update @ericstj. To expand on the note in the original description:
This PR makes an improvement improvement to the average case and resolves the case of appending new entries to the ZIP archive, but it's limited: if the user changes the contents of the first entry in the archive, WriteFile will write the entire archive again - even if the entry content changes have resulted in the entry becoming smaller. The same thing will happen if the user renames the file, even if the encoded name is shorter. A mechanism similar to a compacting heap would fix this, but I felt that needed more design work. The primary tradeoff which I was thinking about was between unused space in the archives, and the time/IO required to start moving entries around (if that's even possible - ZipArchive deals with unseekable streams.) That tradeoff naturally changes as we move towards the end of an archive, since the IO to append a large entry to the end of an archive is less expensive than the IO which reshuffles enough entries to make free space at the start. If there's already a data structure which handles this case in .NET and which accounts for the cost of IO, then this PR can use that directly and reduce the risk. |
From the description's first step:
I assume then that all of the entries from that first changed entry to the last entry will be re-written and thus have no fragmentation and thus be the same physically on disk as before the PR? If so, then we shouldn't need to add a new enum value to ZipArchiveMode but would just need to update the documentation for |
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
* Used named parameter when passing a constant to forceWrite. * Replaced two magic numbers with constants. * Small cleanup of ZipArchive.WriteFile for clarity. * Renamed ZipArchiveEntry.Changed to Changes.
Thanks @steveharter - all of the changes in your review look good, I've rolled them forward. I've also brought the PR up to date, since it pre-dates quite a bit of the zlib changes and the .NET 9 branch.
That's correct, for changes to an entry's contents or to its dynamic-length metadata (name/content/etc.) An existing entry's ExternalAttributes or LastWriteTime could be changed, and since these are fixed-length fields ZipArchiveEntry will simply rewrite the header as-is; if the only change to an entry is a fixed-length field, subsequent entries won't be rewritten. The behaviour of
This PR only deals with the second behaviour. The post-PR documentation should be updated - we no longer guarantee that the entire archive will be held in memory. Instead, we guarantee that every entry which we open (and every entry which follows this in the file) will be loaded into memory. The wording of this is a little tricky though. Perhaps something similar to this?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL area owners @dotnet/area-system-io-compression.
This LGTM. I don't see any potential breaking changes. However, a doc issue should be created as discussed to update the wording around the whole zip will be held in memory.
Thanks @edwardneal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this change, @edwardneal . It's going to improve this code a lot. I'll be curious to see if our perf runs capture any significant perf gains out of this new code that should skip entries that don't need to be rewritten. I'd like to assume our benchmarks touch this code at some point. We'll see. 🤞🏼
I left some comments for you to consider, as well as some questions. Can you please take a look?
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Show resolved
Hide resolved
The list of entries in a ZipArchive is now only sorted when opened in Update mode. Added/modified a test to verify that these entries appear in the correct order: offset order when opened in Update mode, central directory entry order when opened in Read mode.
Thanks for your latest changes. I don't have any more new feedback but I'm still consulting with some folks on ideas about that sort. Meanwhile I'll run some extra CI runs to see if it all looks good on mobile platforms. I'll come back to you when I get some more answers. I mentioned this in your other PR, but mentioning it here too: we can try to aim for Preview1 to merge this change. Code Complete day is Monday January 27th. One thing to keep in mind is that I expect a bunch of merge conflicts with your other PR, so let's get ready for that too. |
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
Thanks carlossanlop. I've checked the logs for the extra runs - there are failures in the same area on the
|
@edwardneal I talked with our security expert @GrabYourPitchforks about any concerns regarding the sort. Here's the summary of the conversation:
We did discuss a behavior that needs verification: if I see that in the proposed code, the sort happens in Unfortunately, there's no test verifying this behavior: opening in Update mode, modifying entries, and adding entries. Can you please add a test for this? Aside from that final request, and running runtime-extra-platforms again, I think we would be good to merge. |
Can you please merge conflicts since I just merged your other PR? 😄 |
Thanks both for the review - that sounds good to me! I'll address the merge conflicts today and add the extra test. |
This accounts for the removal of BinaryReader in an earlier PR
This test modifies an entry at a specific index, adds N entries after it, then verifies that they've been written in the expected order: existing entries first, new entries afterwards
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Assuming no {related} CI issues, we can merge. Thanks @edwardneal!
That's great, thanks for your reviews @carlossanlop. I've checked the test results and they look unrelated to me:
|
There are a handful of different issues surrounding ZipArchive, so this PR touches on a few. It makes ZipArchive more conservative when writing in
WriteFile
. It does this by tracking the type of changes to each archive entry and of the archive itself.Previously, every ZipArchiveEntry's data would be loaded into memory, the output stream would have its length set to zero, then the archive entries would be written out in sequence. This naturally causes problems when working with very large ZIP files.
I've changed this behaviour by forcing the ZipArchive to track the offset its first deleted entry,
_firstDeletedEntryOffset
. When writing a file:startingOffset
), and the first entry with changes to its dynamic-length metadata or entry data (completeRewriteStartingOffset
).startingOffset
, add them to the list (entriesToWrite
) of entries to persist.completeRewriteStartingOffset
, load their contents into memory.startingOffset
, then start writing each entry inentriesToWrite
. When writing:completeRewriteStartingOffset
) write the header. If not, seek past it.This relies upon the list of the ZipArchive's existing ZipArchiveEntry records being sorted in offset order, which is now guaranteed on load.
Issue links:
There's also an issue (1544) which relates to ZipArchiveEntry storing uncompressed data in memory unnecessarily. This PR doesn't fix that issue, but it means that somebody won't fix it and discover that ZipArchive.Dispose is loading the contents into memory anyway.
I've added a number of test cases which cover the corner cases I've thought of, verifying the correct number of writes and the contents of the files as they go. It's quite a core part of ZipArchive though and I'm conscious that a number of libraries depend upon it, so I'm open to adding more.
Edit: From benchmarking the "append case", my results were pretty much as I expected: performance improves, but becomes more variable. Updating/deleting entries at the end of an archive (or after the largest entries in the archive) becomes faster and uses less memory, while updating/deleting entries at the start of an archive takes about as long as it did before.