Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordprocessingDocument.Open is very slow #628

Closed
amanbolat opened this issue Sep 24, 2019 · 15 comments
Closed

WordprocessingDocument.Open is very slow #628

amanbolat opened this issue Sep 24, 2019 · 15 comments
Milestone

Comments

@amanbolat
Copy link

amanbolat commented Sep 24, 2019

Description

WordprocessingDocument.Open is very very slow when reading big .docx document.
i'm trying to read 10 mb sized .docx document and it takes about 1 minute just to open it.

Information

  • .NET Target: .NET Core 2.2
  • DocumentFormat.OpenXml Version: 2.9.0

Repro

Console.WriteLine("Creating filter")
using (var doc = WordprocessingDocument.Open(path, false))
{
        Console.WriteLine("Creating BodyReader");
        _bodyReader = OpenXmlReader.Create(doc.MainDocumentPart.Document);
}

Link to the file: https://drive.google.com/file/d/1_InQLbZ19KCUgkuePAiLXvUuLcZl6Qu7/view?usp=sharing

Uploaded to GitHub: 10mb_file.docx

Observed

I put to lines of Console.WriteLine so the time between "Creating filter" and "Creating BodyReader" is about 1 min. It doesn't matter if i opening file from memory stream or just giving it a real path to the file.

Expected

Instant open expected :)

@twsouthwick
Copy link
Member

Thanks for the issue. What OS are you running on?

@twsouthwick
Copy link
Member

twsouthwick commented Sep 28, 2019

Here's a benchmark version of it:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using DocumentFormat.OpenXml.Packaging;

namespace OpenSample
{
    [CoreJob]
    public class OpenBenchmark
    {
        [Benchmark]
        public WordprocessingDocument Open()
        {
            var path = ...;
            return WordprocessingDocument.Open(path, false);
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<OpenBenchmark>();
        }
    }
}

And the results:

| Method |    Mean |    Error |   StdDev |
|------- |--------:|---------:|---------:|
|   Open | 21.56 s | 0.3080 s | 0.2730 s |

I agree that seems like a lot of time. I'll look into it.

@amanbolat
Copy link
Author

I tried it on windows 10 and MacOS. Results are same.

@amanbolat
Copy link
Author

I'm new to C# so i don't know how to profile it and find the problem, but after looking for source code it seems to me that Open function parse all dependencies in file. And this file in my example has a lot of nested charts and images. May be it will help you.

@twsouthwick
Copy link
Member

ah that is a good observation. I'll have to see if there's a way to lazily load those (I've never actually explored that scenario)

@twsouthwick
Copy link
Member

@amanbolat Sorry for the delay. I've got some time now to look into some of these issues. However, lazily loading this is trickier than I thought. Have you tried loading the package into memory and then accessing it that way? If streaming from disk, it will have some overhead from disk IO. You could try the following:

using(var ms = new MemoryStream())
{
  using(var fs = File.OpenRead("path to file"))
  {
    fs.CopyTo(ms);
  }
  ms.Position(0);

  // Open document from memory stream
}

@amanbolat
Copy link
Author

@twsouthwick Hi. Yes, I tried it and result is same, it will take about 1 minute. I tested on my new MacBook with ssd so the overhead shouldn’t be so big, I think so.

@twsouthwick
Copy link
Member

Ok. I'll try to make this a higher priority, but to set expectations, I work on this as a side project for work so I'm able to do feature work here somewhat sporadically. If you have any desire to take a stab at it, feel free and I'm happy to provide pointers

@amanbolat
Copy link
Author

Thank you.
I’m new to C# but I will try my best.

@ThomasBarnekow
Copy link
Collaborator

@amanbolat, try the following method from my CodeSnippets repository:

        public static MemoryStream ReadAllBytesToMemoryStream(string path)
        {
            byte[] buffer = File.ReadAllBytes(path);
            var destStream = new MemoryStream(buffer.Length);
            destStream.Write(buffer, 0, buffer.Length);
            destStream.Seek(0, SeekOrigin.Begin);
            return destStream;
        }

I've benchmarked a number of ways to clone Office Open XML documents and this came out as the fastest.

@amanbolat
Copy link
Author

@ThomasBarnekow method doesn't matter. Try to open my file and you will see that it will take about 1 or 2 minutes.

Link to the file: https://drive.google.com/file/d/1_InQLbZ19KCUgkuePAiLXvUuLcZl6Qu7/view?usp=sharing

@github-actions
Copy link

Stale issue message

@twsouthwick
Copy link
Member

I'm going to open as I recently made a change to System.IO.Packaging that may help this and I want to verify. See dotnet/runtime#35978

@twsouthwick twsouthwick modified the milestones: 2.11, 2.12 Jul 10, 2020
@twsouthwick
Copy link
Member

FYI the fix that I got into System.IO.Packaging fixes this!

Before:

| Method |    Mean |    Error |   StdDev |
|------- |--------:|---------:|---------:|
|   Open | 9.167 s | 0.1791 s | 0.2265 s |

After:

| Method |     Mean |   Error |  StdDev |
|------- |---------:|--------:|--------:|
|   Open | 184.5 ms | 3.64 ms | 4.85 ms |

Since the package is still in preview, we won't be upgrading it for the project at this time. Note, since the major version of this is changing, we won't actually be able to bring it into the repo until v3.0 (due to semantic versioning). That'll probably happen sometime soon, but not for a bit. However, you may manually bring in 5.0.0-preview.6.20305.6+ of the package to get the benefit (although it will only help you if you are on .NET Core)

@twsouthwick
Copy link
Member

I'm going to close this for now, as it is fixed where it needs to be. Once we move to v3.0, we'll update to the latest System.IO.Packaging so it'll come by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants