Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Compression ZipArchive bad performance over network #31460

Open
KrabatTilt opened this issue Nov 10, 2019 · 3 comments
Open

System.IO.Compression ZipArchive bad performance over network #31460

KrabatTilt opened this issue Nov 10, 2019 · 3 comments
Labels
area-System.IO.Compression question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@KrabatTilt
Copy link

I am working with a file format that contains a ZipArchive inside a Package. To access the data following code is applied:

var outer = Package.Open(filename, FileMode.Open, FileAccess.Read, FileShare.Read);
var data = outer.GetPart(new Uri("/Image.data", UriKind.Relative));
var inner = new ZipArchive(data.GetStream(FileMode.Open, FileAccess.Read), ZipArchiveMode.Read, true);

Running that code targetting netFramewok472 takes about 0.5 sec to open a 400MB file on a SMB network share over a 16Mbit network connection. Memory consumption is 42MB.

Running same code targetting netCoreApp30 on same file, took 290 sec ending up with 800MB memory consumtion.

This is a huge performance drop in access time and memory consumtion. Any suggestions where this comes from and how to workaround?

@stephentoub
Copy link
Member

cc: @ericstj

@ericstj
Copy link
Member

ericstj commented Nov 11, 2019

First off: can you change that format? Packages are ZIPs and storing a ZIP inside a ZIP isn't the best for performance. You aren't gaining anything from compressing twice and the nested zip will require reading more than necessary to extract its contents (additional zip overhead and seeks required to read this).

On .NETFramework the Package APIs had a different ZIP implementation that would buffer more (including buffer to temporary file on disk). This would have happened in this case since the ZipArchive API will end up doing random access on that PackagePart. In addition you end up hitting a codepath in ZipArchive where it copies the entire backing stream: https://github.com/dotnet/corefx/blob/bc115700c3ece60acd6b8dbe4b0bdb8f6f80c756/src/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs#L147. This wouldn't be hit on .NET Framework since the Package APIs would have buffered the part to a file behind the scenes for you. I discussed this a bit here: https://github.com/dotnet/corefx/issues/11669#issuecomment-468027597

To mimic the .NETFramework behavior, try extracting the Part to a temporary file and then opening the ZipArchive over that stream. That should give you similar memory characteristics and hopefully similar performance.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@carlossanlop carlossanlop modified the milestones: 5.0.0, Future Jun 18, 2020
@KrabatTilt
Copy link
Author

KrabatTilt commented Aug 28, 2020

Sry for the very late answer.

can you change that format?

Not realy as it is a legacy format with a lot of files already beeing in circulation.

Packages are ZIPs and storing a ZIP inside a ZIP isn't the best for performance. You aren't gaining anything from compressing twice and the nested zip will require reading more than necessary to extract its contents (additional zip overhead and seeks required to read this).

The thing is that no compression is used at all. The outer as well as the inner archive are just used as containers and are generated using CompressionLevel.NoCompression. The inner archive holds thousands of small entries and can be seen as a readonly container. The outer archive holds entries containing meta information about the entries of the inner archive.

When no compression is used at all, it is possible to random access all data from the nested acrhive by reading directly from the underlying FileStream (which is seekable) without using DeflateStream in between. And that is how I solved this special case for now, but I had to implement my own custom ZipReader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.IO.Compression question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests

5 participants