Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large parts cannot be written on .NET Core due to OutOfMemoryException #807

Closed
Viktor-36 opened this issue Sep 16, 2020 · 29 comments
Closed

Comments

@Viktor-36
Copy link

Viktor-36 commented Sep 16, 2020

Description

I was using ClosedXML to create large excel files (100k rows with 63 colums) and I faced issue with OutOfMemoryException. I found some examples of OpenXML using SAX, so I tried to switch to OpenXML, but it didn't help me. I tried to remove all my code (reading from DB, etc.) to try if that works, but I still get OutOfMemoryException.

My code is based on http://polymathprogrammer.com/2012/08/06/how-to-properly-use-openxmlwriter-to-write-large-excel-files/

Information

  • .NET Target: .NET Core 3.1
  • DocumentFormat.OpenXml Version: 2.11.3

EDIT: Same code is working fine in .NET Framework 4.7.2 with same DocumentFormat.OpenXml version.

Repro
This is simple code that I use at the moment. I am testing with RAM limited to 500 MB (testing purpose). I don't think that this code can consume so much RAM.

using (SpreadsheetDocument document = SpreadsheetDocument.Create(filePath, SpreadsheetDocumentType.Workbook))
{
    document.AddWorkbookPart();

    WorksheetPart wsp = document.WorkbookPart.AddNewPart<WorksheetPart>();

    using (OpenXmlWriter writer = OpenXmlWriter.Create(wsp))
    {
        List<OpenXmlAttribute> oxa;
        writer.WriteStartElement(new Worksheet());
        writer.WriteStartElement(new SheetData());

        for (int i = 0; i < 100000; i++)
        {
            oxa = new List<OpenXmlAttribute>();
            oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));
            writer.WriteStartElement(new Row(), oxa);

            for (int j = 0; j < 40; j++)
            {
                oxa = new List<OpenXmlAttribute>();
                oxa.Add(new OpenXmlAttribute("t", null, "str"));
                writer.WriteStartElement(new Cell(), oxa);
                writer.WriteElement(new CellValue("test"));
                writer.WriteEndElement();
            }

            writer.WriteEndElement();
        }

        writer.WriteEndElement(); // end of sheetdata
        writer.WriteEndElement(); //end of worksheet
    }

    using (OpenXmlWriter writer = OpenXmlWriter.Create(document.WorkbookPart))
    {
        writer.WriteStartElement(new Workbook());
        writer.WriteStartElement(new Sheets());

        writer.WriteElement(new Sheet() { Id = document.WorkbookPart.GetIdOfPart(wsp), SheetId = 1, Name = "Test" });

        writer.WriteEndElement();
        writer.WriteEndElement();
    }
}

Observed

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.IO.MemoryStream.set_Capacity(Int32 value)
   at System.IO.MemoryStream.EnsureCapacity(Int32 value)
   at System.IO.MemoryStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.Xml.XmlUtf8RawTextWriter.FlushBuffer()
   at System.Xml.XmlUtf8RawTextWriter.RawText(Char* pSrcBegin, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.RawText(String s)
   at System.Xml.XmlUtf8RawTextWriter.WriteEndElement(String prefix, String localName, String ns)
   at System.Xml.XmlWellFormedWriter.WriteEndElement()
   at DocumentFormat.OpenXml.OpenXmlPartWriter.WriteEndElement()

Expected

Excel file in filePath with 100k rows and 40 columns with string "test" in all cells.

@Viktor-36 Viktor-36 changed the title OutOfMemoryException using SAX OutOfMemoryException using SAX in .NET Core Sep 17, 2020
@Viktor-36
Copy link
Author

Viktor-36 commented Sep 17, 2020

I closed the issue by mistake, so I am reopening it... I was trying to find more information about this issue and I found some issues with System.IO.Packaging... Is there some workaround to temporarily fix my problem? I really need to generate large excel files in .NET Core.

Related issues:
dotnet/runtime#23750
dotnet/runtime#1544

@Viktor-36 Viktor-36 reopened this Sep 17, 2020
@twsouthwick
Copy link
Member

This is definitely dotnet/runtime#1544. I am unsure of a good workaround at this time.

@rsdelapaz
Copy link

I am having the same problem. Has someone found a workaround for this issue?

Thank you in advance

@nike61
Copy link

nike61 commented Mar 28, 2021

Hello @twsouthwick. Did you see the workaround that was supposed in dotnet/runtime#1544 discussion?

@rsdelapaz
Copy link

Hi @Viktor-36, how have you fixed or workaround this issue?I need to generate large excel files in .NET Core too.

Hi @twsouthwick, have you read the message from @nike61?

@clement911
Copy link

We are also running into that exact same issue, even though we call SpreadsheetDocument.Create with a FileStream.
Our environment is .Net 5 on Azure App service.

@sorensenmatias
Copy link
Contributor

Our company is currently being hit hard by this issue for both WordProcessing, Presentation and Spreadsheet. We are seeing big spikes of memory consumption on our production environment when modifying the OpenXml of certain documents. This gives significant problems for our enterprise customers.
Any updates or workarounds would be most appreciated.

@blakepell
Copy link

blakepell commented May 28, 2021

@sorensenmatias @clement911 @rsdelapaz

Sorry for the spam, I got something working and thought perhaps it might help someone looking for a workaround. I don't love my solution but, it did work. Since I wasn't able to write to a second OpenXmlWriter I did this (order is important):

  1. Create a FileStream (I used File.Create).
  2. Create a Package, pass in the FileStream and use FileMode.Create and FileAccess.Write
  3. Create a SpreadsheetDocument via SpreadsheetDocument.Create
  4. Write your large WorksheetPart via an OpenXmlWriter
  5. Close and Dispose of the writer, the package, the file stream, etc.
  6. Create a FileStream (open this time, File.Open with FileMode.Open, FileAccess.ReadWrite and FileShare.None)
  7. Create a Package, pass in the FileStream and use FileMode.Open and FileAccess.ReadWrite
  8. Create a SpreadsheetDocument via SpreadsheetDocument.Open
  9. Create an OpenXmlWriter for the WorkbookPart and add the elements for Workbook and Sheets, then you'll associate the Sheet you added on the original create, close and dispose of those objects and done.

What I found was I was only able to use FileAccess.Write on the first dataset I wrote to the spreadsheet, if I tried to write anything else it would throw an exception (where ReadWrite did not, so I can only assume once I start a second writer the OpenXml library needs to read something). That's why I had to two sets of operations (the ReadWrite part at the end is very small, so no memory concerns there as it doesn't uncompress into memory the large sheet).

Outside of the acrobatics, the limitation of this approach seems to be that you can only have 1 large sheet (but from the examples I've seen, many people are only writing one large sheet per spreadsheet so this might help someone).

@blakepell
Copy link

Here's more of a working sample for .NET Core that should get people started on the workaround. I just wrote out a 500,000 row dataset and the memory footprint stayed pretty low. There are a few places where some extension methods are used (like SafeLeft), you can remove those and put in what you need (GetCell in particular isn't super clean, keep in mind, proof of concept). What you'll be interested really is the order of the ToFile static method and it follows the outline in the bulleted list on my last comment.

The biggest limitation I can see is that you can only write one large sheet to the document (after that ReadWrite is required, I could never get a second Write only stream to work).

https://gist.github.com/blakepell/8fe938624f1dad8c28ff93a334687d77

@sorensenmatias
Copy link
Contributor

Thanks @blakepell, this is very helpful. Unfortunately, doing Write instead of Read/Write would require a huge refactoring (months) for our use case. So our hopes is still that the root issue here could be fixed.
@twsouthwick is there any news on the matter?

@blakepell
Copy link

One other note, with the workaround approach I posted you can add additional sheets later in ReadWrite, they just can't be SAX based (as far as I can see). I just tested adding a second sheet with a few rows in the traditional way after the initial large dataset and the memory footprint only grew as large as that second sheet I added (again, that requires re-opening the excel document in ReadWrite).

@j2jensen
Copy link

@blakepell 's solution gave us an error ("Cannot retrieve parts of writeonly container") until we updated to the latest version of the System.IO.Packaging NuGet package.

@M4urici0GM
Copy link

Any Updates on that?

@BNarayanSharma
Copy link

BNarayanSharma commented Jan 10, 2023

Is there any update on when it will be fixed? My .NET Core Migration for a few projects is on hold for the last two years due to this issue.

I need to generate a large excel file, and it throws the exception "Stream was too long"

@M4urici0GM
Copy link

M4urici0GM commented Jan 10, 2023

Well, i was able to fix it, but i don't remember what i did, what i can say, is that back then, i was able to fix this issue doing something like this:

for some reason, if you first write the data to a WorksheetPart, close the stream, and then write the Sheet attributing the WorksheetPart to it, it runs a lot smoother

Note that im using System.IO.Packaging there, as far as i read (and remember) there was some issues with OpenXML internal Packaging on .net5 to 6, and some people show me this workaround using the native System.IO.Packaging, we need to verify that's still broken on .net 7 though

Doing this, i was able to generate aprox 1M rows excel using 1-2GiB of ram, with a high CPU usage though, but a LOT faster, but also note that im using a IAsyncEnumerable in order to stream data to the file, i was using RabbitMq for streaming batches of 1000 lines each message, so its a good idea of streaming it from the database as well
Here's the code i've used on the past, see if works for you, feel free to do your modifications, and let me know if you need something

 /// <summary>
  ///   Writes .xlsx file to a given path using the provided data stream.
  /// </summary>
  /// <param name="filePath"></param>
  /// <param name="columns"></param>
  /// <param name="dataStream"></param>
  public async ValueTask GenerateFile(
    string filePath,
    IDictionary<string, string> columns,
    IAsyncEnumerable<IEnumerable<T>> dataStream)
  {
    var worksheetPartId = await WriteInitialDocument(filePath, columns, dataStream);
    await WriteSheetToDocumentAsync(filePath, worksheetPartId);
  }

 private async Task<string> WriteInitialDocument(
    string filePath,
    IDictionary<string, string> columns,
    IAsyncEnumerable<IEnumerable<T>> dataStream)
  {
    await using var fileStream = File.Create(filePath);
    using var package = Package.Open(fileStream, FileMode.Create, FileAccess.Write);
    using var excel = SpreadsheetDocument.Create(package, SpreadsheetDocumentType.Workbook);

    excel.AddWorkbookPart();
    ApplyStyles(excel);

    var worksheetPart = excel.WorkbookPart.AddNewPart<WorksheetPart>();
    var workbookId = excel.WorkbookPart.GetIdOfPart(worksheetPart);

    await WriteStreamDataToFileAsync(columns, dataStream, worksheetPart);

    return workbookId;
  }
  
  /// <summary>
  ///   Finishes the process of writing data files.
  /// </summary>
  /// <param name="filePath"></param>
  /// <param name="worksheetPartId"></param>
  private static async Task WriteSheetToDocumentAsync(string filePath, string worksheetPartId)
  {
    await using var fileStream = File.Open(filePath, FileMode.Open, FileAccess.ReadWrite, FileShare.None);
    using var package = Package.Open(fileStream, FileMode.Open, FileAccess.ReadWrite);
    using var excel = SpreadsheetDocument.Open(package);

    if (excel.WorkbookPart is null)
      throw new InvalidOperationException("Workbook part cannot be null!");

    var xmlWriter = OpenXmlWriter.Create(excel.WorkbookPart);
    xmlWriter.WriteStartElement(new Workbook());
    xmlWriter.WriteStartElement(new Sheets());

    xmlWriter.WriteElement(new Sheet {Id = worksheetPartId, Name = "Sheet 1", SheetId = 1});

    // For Sheets
    xmlWriter.WriteEndElement();

    // For Workbook
    xmlWriter.WriteEndElement();

    xmlWriter.Close();
    xmlWriter.Dispose();
  }

 /// <summary>
  ///   Streams and writes data to file.
  /// </summary>
  /// <param name="columns"></param>
  /// <param name="dataStream"></param>
  /// <param name="worksheetPart"></param>
  private async Task WriteStreamDataToFileAsync(
    IDictionary<string, string> columns,
    IAsyncEnumerable<IEnumerable<T>> dataStream,
    OpenXmlPart worksheetPart)
  {
    using var xmlWriter = OpenXmlWriter.Create(worksheetPart);

    xmlWriter.WriteStartElement(new Worksheet());
    xmlWriter.WriteStartElement(new SheetData());

    var rowIndex = 1;
    var attributes = new List<OpenXmlAttribute> {new OpenXmlAttribute("r", string.Empty, rowIndex.ToString())};

    xmlWriter.WriteStartElement(new Row(), attributes);

    var columnList = columns.Values.ToList();
    foreach (var column in columnList)
    {
      var cellValue = column.StripHexadecimalSymbols();
      var cell = new Cell {CellValue = new CellValue(cellValue), StyleIndex = 1, DataType = CellValues.String};

      xmlWriter.WriteElement(cell);
    }

    // For Header Row
    xmlWriter.WriteEndElement();

    var columnKeysList = columns.Keys.ToList();
    await foreach (var itemEnumerable in dataStream)
    {
      var itemList = itemEnumerable.ToList();
      foreach (var t in itemList)
      {
        rowIndex += 1;

        // Starts a new row.
        xmlWriter.WriteStartElement(
          new Row(),
          new List<OpenXmlAttribute> {new("r", string.Empty, rowIndex.ToString())});

        foreach (var columnKey in columnKeysList)
        {
          var cellValue = _cellFormatterStrategy.GetCellValue(ReportFormats.Xlsx, t, columnKey);
          var cell = _cellFormatterStrategy.GetXlsxCell(columnKey, cellValue);

          xmlWriter.WriteElement(cell);
        }

        // Closes the row.
        xmlWriter.WriteEndElement();
      }
    }

    // For SheetData
    xmlWriter.WriteEndElement();

    // For Worksheet
    xmlWriter.WriteEndElement();
  }

@twsouthwick
Copy link
Member

twsouthwick commented Jan 10, 2023

I'd like to get this fixed (at least a work around) for v3.0. I've created a set of abstractions that allow for more control over things and I think we could automate some of the work arounds here (at least in an opt-in way). For the abstractions, see: #1295.

My thoughts would be to model what @M4urici0GM did, but in a more transparent way. Of course, it would be better to have this fixed in the underlying package model, but that hasn't gone anywhere in too many years.

My initial thoughts to implementing this would be:

(1) Provide an abstraction of IPackage that would intercept calls to GetStream and write them to some temporary location
(2) On save, first, save the package as normal
(3) Then reopen the package in just write mode (this should allow the replacing of things without the explosion of memory)
(4) write the streams from the temporary location
(5) Close the package again and reopen with original mode/access

The abstractions I have should allow building this, except we'd need a way to "Reload" the underlying package. Building off of the abstractions, I'm thinking of enabling the following:

public interface IPackageFeature
{
  IPackage Package { get; }

+ bool CanReload { get; }

+ void Reload();
}

This could automatically be supported for files opened with paths or streams, but if a package is given, then it would not be supported (since we didn't manage the package) without additional information from a user.

@twsouthwick twsouthwick added this to the v3.0 milestone Jan 10, 2023
@twsouthwick twsouthwick changed the title OutOfMemoryException using SAX in .NET Core Large parts cannot be written on .NET Core due to OutOfMemoryException Jan 10, 2023
@twsouthwick twsouthwick self-assigned this Feb 3, 2023
@pre-alpha-final
Copy link

pre-alpha-final commented Mar 2, 2023

I made a hacky solution for this. Obviously not production-ready, but the idea itself, I think, is good. Instead of creating the Cell object from the start, we create it on the fly and drop it right after use. Made a poc repo for this:
https://github.com/pre-alpha-final/openxml-memory-usage-hack
The results are promising. I went from 6.1GB of RAM to 2.1GB out of which around 300MB+ is just the data itself.

EDIT: Run it with @Viktor-36 's parameters form original post (100k rows, 40 cells each, with text "test"). The standard save used up 2GB or RAM, the hacky save used 321MB.

@pre-alpha-final
Copy link

pre-alpha-final commented Mar 9, 2023

Created branches with support for the common data types and one with has the Cell creation method as a Func. This way you're not stuck with default mapping, making this less of a hack and more of a valid workaround.

@twsouthwick twsouthwick modified the milestones: v3.0, v3.1 Aug 3, 2023
@brdrlx
Copy link

brdrlx commented Aug 28, 2023

If anyone is facing the some problem nowadays - the workaround by @blakepell doesn't work in newer releases.
After some researching I found that it works with v2.16.0 (at least) - memory usage in my case (50k records with 30 column) dropped from 100Mb to 5Mb. Hope v3.1 will be able to solve that problem as well.

@RamyaVempalla
Copy link

RamyaVempalla commented Oct 18, 2023

@twsouthwick Is this issue will be resolved in v3.0? When can we expect v3.0? We tested with beta v3, still this issue exists

@twsouthwick
Copy link
Member

Unfortunately, I wasn't able to get it to work and had other priorities. I'll move this to v3.1 to see what can be done

@clauderichardgeotab
Copy link

I was wondering the reason why blakepell's workaround can't work anymore in OpenXml 2.20 and up? It seems a validation check was added preventing the use of SpreadsheetDocument.Create(some package with write-only access, SpreadsheetDocumentType.Workbook).

I tried a piece of blakepell's workaround but I came across a validation check that throws if the zip package has no read access. In v2.20.0: https://github.com/dotnet/Open-XML-SDK/blob/release/v2.20.0/src/DocumentFormat.OpenXml/Packaging/OpenXmlPackage.cs:

if (package.FileOpenAccess == FileAccess.Write)
{
    throw new OpenXmlPackageException(ExceptionMessages.PackageMustCanBeRead);
}

Or if I try in OpenXml 3.0.0, Package.cs checks and throws with ThrowIfWriteOnly:

private PackageRelationshipCollection GetRelationshipsHelper(string filterString)
    {
      this.ThrowIfObjectDisposed();
      this.ThrowIfWriteOnly();
      this.EnsureRelationships();
      return new PackageRelationshipCollection(this._relationships, filterString);
    }

Why was this validation check added when it wasn't there in previous versions? Would removing that check in newer versions of OpenXml cause other issues?

@clauderichardgeotab
Copy link

@carlossanlop Would you have any thoughts on my comment above? #807 (comment)

This seems related to the following dotnet issues that have been open for a long time.
dotnet/runtime#1543
dotnet/runtime#1544
Do you have any update on these issues and when these can move forward?

These have been causing memory usage issues with MemoryStream whenever we generate a large Excel file using OpenXml, that in turn uses System.IO.Packaging (which gets down to System.IO.Compression, method OpenInUpdateMode in ZipArchiveEntry.cs which keeps uncompressed data in a MemoryStream), and we're hoping this might be resolved on the .NET side.

@kotofsky
Copy link

.Net team still haven't resolved this issue. Is there any workaround or something?

@M4urici0GM
Copy link

@kotofsky the only workaround (i know so far, is the one i've published above)

@kolvahaa
Copy link

@blakepell I've expanded your workaround.

Just generate worksheetPartId and create WorksheetPart using it. There are also no restrictions on the number of document sheets with this approach

using (var package = Package.Open(stream, FileMode.Create, FileAccess.Write))
{
    using (var excel = SpreadsheetDocument.Create(package, SpreadsheetDocumentType.Workbook))
    {
        WorkbookPart wbp = excel.AddWorkbookPart();

        string worksheetPartId = "W1";

        using (var oxw = OpenXmlWriter.Create(wbp))
        {
            oxw.WriteStartElement(new Workbook());
            oxw.WriteStartElement(new Sheets());

            oxw.WriteElement(new Sheet()
            {
                Name = "test",
                SheetId = 1,
                Id = worksheetPartId
            });

            oxw.WriteEndElement();
            oxw.WriteEndElement();
        }

        var wsp = wbp.AddNewPart<WorksheetPart>(worksheetPartId);
        using (var oxw = OpenXmlWriter.Create(wsp))
        {
            oxw.WriteStartElement(new Worksheet());
            oxw.WriteStartElement(new SheetData());

            foreach (var row in rows)
            {
                // write rows
            }

            oxw.WriteEndElement();
            oxw.WriteEndElement();
        }
    }
}

Sorry, for text translation i used Google translator))))

@robert94p
Copy link

robert94p commented Aug 13, 2024

To get @blakepell solution to work, is it necessary to use only the older version 2.16.0? @twsouthwick Will this issue be resolved in version 3.2.0?

@AlfredHellstern
Copy link
Collaborator

@AlfredHellstern
Copy link
Collaborator

closing as duplicate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.