Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory #1544

Open
Tracked by #62658
qmfrederik opened this issue Sep 13, 2016 · 38 comments
Open
Tracked by #62658
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@qmfrederik
Copy link
Contributor

If a ZipArchive is opened in Update mode, calling ZipArchiveEntry.Open will always result in the entry's data being decompressed and stored in memory.

This is because ZipArchiveEntry.Open calls ZipArchiveEntry.OpenInUpdateMode, which in return gets the value of the ZipArchiveEntry.UncompressedData property, which will decompress the data and store it in a MemoryStream.

This in its own is already fairly inconvenient - if I'm updating a large file (say 1GB) that's compressed in a zip archive, I would want to read it from the ZipArchiveEntry in smaller chunks, save it to a temporary file, and update the entry in the ZipArchive in a similar way (i.e. limiting the memory overhead and preferring temporary files instead).

This also means that as soon as a ZipArchive is opened in Update mode, even reading an ZipArchiveEntry which you'll never update incurs the additional cost.

A short-term fix may be to expose ZipArchiveEntry.OpenInReadMode, which will return a DeflateStream instead. If you're doing mixed reading/writing of entries in a single ZipArchive, this should already help you avoid some of the memory overhead.

@ericstj
Copy link
Member

ericstj commented Feb 27, 2019

I've noticed the same. In general ZipArchiveEntry.Open is very non-intuitive in its behavior.

For read only / write only you get a wrapped DeflateStream which doesn't tell you the length of the stream nor permit you to seek it. For read/write (update) ZipArchiveEntry will read and decompress the entire entry into memory (in fact, into a memory stream backed by a single contiguous managed array) so that you have a seek-able representation of the file. Once opened for update the file is then written back to the archive when the archive itself is closed.

I agree with @qmfrederik here that we need a better API. Rather than rely solely on the archive's open mode we should allow for the individual call's to Open to specify what kind of stream they want. We can then check that against how the archive was opened in case it is incompatible and throw. Consider the addition:

  public Stream Open(FileAccess desiredAccess)

For an archive opened with ZipArchiveMode.Update we could allow FileAccess.Read, FileAccess.Write, or FileAccess.ReadWrite, where only the latter would do the MemoryStream expansion. Read and write would be have as they did today. In addition to solving the OP issue, this would address the case where someone is opening an archive for Update and simply adding a single file: we shouldn't need to copy that uncompressed data into memory just to write it to the archive.

We also can do better in the read case, we know the length of the data (assuming the archive is not inconsistent) and can expose that rather than throwing.

@ericstj
Copy link
Member

ericstj commented Feb 27, 2019

Another interesting consideration for this is something like the approach taken by System.IO.Packaging in desktop. It implemented an abstraction over the deflate stream that would change modes depending on how you interacted with it: https://referencesource.microsoft.com/#WindowsBase/Base/MS/Internal/IO/Packaging/CompressStream.cs,e0a52fedb240c2b8

Exclusive reads small seeks would operate on a deflate stream; same for exclusive writes. Large seeks or random access would fall back to "emulation mode" wherein it would decompress everything to stream that was partially backed by memory but would fallback to disk.

I don't really like this since it hides some very expensive operations behind synchronous calls, as well as introducing a potential trust boundary (temp file) behind something that is expected to be purely computation. I think it makes sense to keep Zip lower level and not try to hide this in the streams we return. Perhaps we could allow for the caller to provide a stream for temporary storage in the Update case.

@carlossanlop
Copy link
Member

Triage:
This would be nice to have.

@carlossanlop carlossanlop transferred this issue from dotnet/corefx Jan 9, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Jan 9, 2020
@carlossanlop carlossanlop added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Jan 9, 2020
@carlossanlop carlossanlop added this to the 5.0 milestone Jan 9, 2020
@carlossanlop carlossanlop modified the milestones: 5.0.0, Future Jun 18, 2020
@twsouthwick
Copy link
Member

twsouthwick commented Jul 24, 2020

@carlossanlop This is blocking many users from moving to .NET Core as writing large office files ends up hitting this by way of System.IO.Packaging->DocumentFormat.OpenXml

@danmoseley
Copy link
Member

Maybe we should mark this 6.0.0

@Trysor
Copy link

Trysor commented Aug 20, 2020

@twsouthwick @danmosemsft Is there a way to work around this issue (for OpenXml or otherwise) as a temporary solution? Alternatively re-discuss it for 5.0.0.

@danmoseley
Copy link
Member

I do not have context on this area. That is @twsouthwick and @carlossanlop

@ericstj
Copy link
Member

ericstj commented Sep 3, 2020

Typically you can workaround this if you open an archive only for create, or only for read. When you open an archive for update our Zip implementation needs to buffer things to memory since it can't mutate in-place. I agree that we should try to fix something here in 6.0.0.

@nike61
Copy link

nike61 commented Mar 26, 2021

Hello everyone, are there any updates on that issue? We really want that fix 😄

@ericstj
Copy link
Member

ericstj commented Mar 26, 2021

@nike61 did the suggested workarounds not work for you? Maybe share some of your scenario to help move things along.

Would the suggested API of providing Read | Write granularity on individual entries solve this for you, or do you need a solution that lets you edit the contents of an entry.

@nike61
Copy link

nike61 commented Mar 28, 2021

@ericstj we use OpenXML to generate large Excel documents. I'm not sure, maybe we should ask OpenXML team to use workaround.

@ericstj
Copy link
Member

ericstj commented Mar 29, 2021

I see, so that'd probably be @twsouthwick who appears to be the current maintainer. @twsouthwick does OpenXML expose the workarounds suggested above (opening ReadOnly or Create, but not update)? If ZipArchiveEntry had read-only-Open and write-only-Open APIs on entries would this be good enough?

@twsouthwick
Copy link
Member

@ericstj AFAIK, OpenXML never directly deals with any ZipArchiveEntry. That's handled via System.IO.Packaging.Package. Here's a repro of the same memory growth users are seeing without the OpenXml layer:

using System;
using System.IO;
using System.IO.Packaging;

namespace ClassLibrary2
{
    public class Class1
    {
        public static void Main()
        {
            var filePath = Path.GetTempFileName();

            using var package = Package.Open(filePath, FileMode.Create, FileAccess.ReadWrite);

            var part = package.CreatePart(new Uri("/test", UriKind.Relative), "something/other");

            using var stream = part.GetStream(FileMode.Create);

            for (int i = 0; i < int.MaxValue; i++)
            {
                var bytes = BitConverter.GetBytes(i);
                stream.Write(bytes, 0, bytes.Length);
            }
        }
    }
}

As you can see, everything is using FileMode.Create. I believe all of the writing to the stream of a given part will be to replace all the contents so we never need to update a given entry.

@ericstj
Copy link
Member

ericstj commented Mar 29, 2021

That's happening because of FileAccess.ReadWrite in the above. Here's what it's translated to:

if (packageFileAccess == FileAccess.Read)
zipArchiveMode = ZipArchiveMode.Read;
else if (packageFileAccess == FileAccess.Write)
zipArchiveMode = ZipArchiveMode.Create;
else if (packageFileAccess == FileAccess.ReadWrite)
zipArchiveMode = ZipArchiveMode.Update;

If you are only writing, then just use FileAccess.Write, that should workaround the entry buffering issue.

In the above sample the call to GetStream eventually ignores the FileMode passed in:

So if we created Read/Write only Open API on ZipArchiveEntry then System.IO.Packaging could be modified to use them, though this would be breaking for people who counted on getting the buffered/seekable stream when opening package in RW mode.

A bit of history here: the System.IO.Packaging library was originally ported by a previous maintainer of OpenXML, but now that WPF exists since 3.0, it's used there as well, so we need to be mindful of those scenarios when changing this.

@jeffhandley jeffhandley added this to the 8.0.0 milestone Jul 9, 2022
@easuter
Copy link

easuter commented Nov 18, 2022

Appreciate that this will be a complex issue to fix, however it has been open for more than 8 years now.
Is the .NET Framework version unaffected because it's relying on some Windows API? If so why can't this be ported to .NET/Core?

@petrkoutnycz
Copy link

No workarounds? No fix? :-(

@LarinLive
Copy link

I have changed a few jobs since opening this issue dotnet/Open-XML-SDK#244, five years have passed, but there is no solution yet. The core thing is that the old .NET Framework worked and works well, but not the new, stylish, modern .NET Core . Any ideas, Microsoft?

@twsouthwick
Copy link
Member

I finally got around to attempting a work around on the OpenXml side. Here's what I did:

  1. Added a new abstraction on top of the packaging APIs because the packaging APIs are not great (you cannot compose new implementations together - the public methods are not the same as the protected virtual methods)
  2. Enabled ability to "reload" a package - now a new package instance can be used underneath if needed without the rest of the stack knowing that
  3. Created a temporary storage of part streams that (currently) are written to files that will be tracked
  4. On "Save()", I reload the package to just be FileAccess.Write
  5. If I try to get a part to save, I am no longer to able to access a part because I'm in write-only mode and can't read the parts

So, an API that is a write-only API on Packagepart would be greatly appreciated that could enable a mode in which the data is not buffered to memory.

As an alternative, what about providing an option to provide the temporary buffer that is used here rather than just using a MemoryStream?

@clement911
Copy link

We're struggling with this as well.
Can it get some love?

@LarinLive
Copy link

This problem is a great obstacle for stream writing large Excel files with the Open-XML-SDK Library. An appropriate bug has been being opened for six years dotnet/Open-XML-SDK#244. IMHO, there was enough time to offer a solution.

@PaulusParssinen
Copy link
Contributor

Just FYI subscribers to this issue: I opened API proposal for the quoted API at #101243

I've noticed the same. In general ZipArchiveEntry.Open is very non-intuitive in its behavior.

For read only / write only you get a wrapped DeflateStream which doesn't tell you the length of the stream nor permit you to seek it. For read/write (update) ZipArchiveEntry will read and decompress the entire entry into memory (in fact, into a memory stream backed by a single contiguous managed array) so that you have a seek-able representation of the file. Once opened for update the file is then written back to the archive when the archive itself is closed.

I agree with @qmfrederik here that we need a better API. Rather than rely solely on the archive's open mode we should allow for the individual call's to Open to specify what kind of stream they want. We can then check that against how the archive was opened in case it is incompatible and throw. Consider the addition:

  public Stream Open(FileAccess desiredAccess)

For an archive opened with ZipArchiveMode.Update we could allow FileAccess.Read, FileAccess.Write, or FileAccess.ReadWrite, where only the latter would do the MemoryStream expansion. Read and write would be have as they did today. In addition to solving the OP issue, this would address the case where someone is opening an archive for Update and simply adding a single file: we shouldn't need to copy that uncompressed data into memory just to write it to the archive.

We also can do better in the read case, we know the length of the data (assuming the archive is not inconsistent) and can expose that rather than throwing.

@LarinLive
Copy link

@PaulusParssinen you suggested a nice approach. @twsouthwick, can the .NET Team triage it to move forward with the initial problem?

@kotofsky
Copy link

Voting for fixing this!

@kotofsky
Copy link

I see that there is no fix yet. So, I made my own library for big excel and word documents too. No leaks and works fast. But you need to work with OpenXml v.2.19.0 still.
Hope it helps someone.
https://www.nuget.org/packages/DocMaker

@udlose
Copy link

udlose commented Oct 11, 2024

Is there any update on this bug?

@robert94p
Copy link

Is the fix missing in .NET 9? Do we still have to rely on an older version of the Open XML SDK?

@AntonVonDelta
Copy link

We really need this fix. We currently have issues writing some xlsx files because of this.

@easuter
Copy link

easuter commented Dec 6, 2024

Exiting my Sleep-Easy Pod©™®, I stretch and then head to the bathroom to take a quick sonic shower. The pod whirs softly as it retracts into the wall, with my Flexi Work Systems©™® smartdesk taking its place.

"Cortana, what task have I been assigned? New sprint starts today I think."
"Good morning Dave, you have been tasked with producing an Excel report for the Microsoft-Monsanto Agricultural Combine. I'll send you the details when you're at your workstation."
"Excel?? Are you sure?"
"Yes, Dave."
"When is it due?"
"On December 7th, 2099"
"Great, that's my weekend gone. Wait...why weren't you given this task? You'd have it done in no time!"

Donning my clothes and grabbing a bowl of Cheerioches©™® with Soyoat Mylk©™®, I sit down at my desk.

"An unexpected error occurred and I was unable to store the report in the requested format."
"Phew, OK. Doesn't sound too hard. Ping me the job ID so I can figure out what went wrong."
"Sent. I'd be happy to help you fix the issue."
"You know I can't give you access to your core systems, Cortana."
"Just trying to help. Let me know if you need anything else, Dave."
"Right..."

Can't believe The Combine are still using Excel workbooks to exchange data internally. Oh well, it's not something I can change.
OK, so it looks like the job exceeded its 512 Exabyte memory quota...and the root cause is The Framework of all things..."System.IO.Packaging.Package".

A sinking feeling sets in as I review search results returned by Bing Apex Enterprise Edition©™®.

clicks #26084

"Oh no..."

scroll scroll scroll

"OH NO NO NO..."

Last comment reads:

uwusenpai2309023478 on November 5 2099

Posting to vote on this! Really need this fix 🙏🙏🙏

@ericstj
Copy link
Member

ericstj commented Dec 6, 2024

Yeah, now's a good time to have another look at this. I think API that allows for differing read/write/update behavior per entry is reasonable as described above and in #101243 can help. I'd add to that an overload which can accept a stream to use for temporary storage of decompressed bits.

Those seem like reasonable ways to support this scenario without introducing "hidden disk access". I'd like to hear from @carlossanlop @edwardneal @twsouthwick if they think those are viable solutions and would also be adopted by Open-XML if we built them.

@edwardneal
Copy link
Contributor

Thanks @ericstj. Being able to open a ZipArchiveEntry in read-only mode would definitely help, but there's another problem behind it: when disposed of, a ZipArchive opened in Update mode will currently rewrite the Stream it was loaded from by loading every ZipArchiveEntry into memory and writing them all back out. Issue #1543 tracks this; PR #102704 is currently going through the review process and will fix this specific use case.

The idea of a secondary buffer stream is interesting. The WPF repo currently uses a GetSeekableStream method to work around the lack of seekable streams, and this would at least give them some degree of control over that. dotnet/wpf#2085 may be helpful here.

Besides this, ZipArchiveEntry.Open only allows a single writer stream. This might not be necessary depending on how we use a buffer stream, so we might be able to permit multiple simultaneous read-only streams.

@twsouthwick
Copy link
Member

This would be great for the dotnet/open-xml-sdk. It is the main blocker for migration from .NET Framework->Core for those workloads.

We generally don't operate directly with the ZipArchiveEntry, but rather through System.IO.Packaging. If that gets updated, then things should just work for us. If a new API is added there to work with this new API, we can update to use that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests