-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.IO.Packaging.InternalRelationshipCollection.GetRelationshipIndex doesn't scale #983
Comments
@carlossanlop would we accept a PR here? |
Thank you for bringing this to our attention, @danmosemsft. Here are my notes about this request:
@mabrahamsen, a couple of requests for you:
|
Hi @carlossanlop, I have extracted a simple sample and attached hit here. This will generate an Excel sheet using the Microsoft OpenXML SDK, and I have code with or without the hyperlink. You will see how the hyperlink version doesn't scale if you change the rowcounts. With the current rowcount a profiler will show you that about 98% of the time is used calling InternalRelationshipCollection.Add, to add a hyperlink per row. Also, when I mentioned the HashSet, it was more to point towards a collection lookup strategy to avoid doing all the loops over the internal list. Not that I recommended changing the type itself to inherit from IDictionary<TKey, TValue> or ISet. I suspect that in additional to maintaining the IEumerable implementation, some clients might expect the collection to keep the sequence provided by the List implementation. One could partner the List with a HashSet to help with the Add uniqueness validation, but as the list supports index lookup (which could be solved with Dictionary) and Delete (which would require something similar to a Lazy HashSet/Dictionary rebuild) it becomes slightly more complicated. If the client doesn’t depend on the sequence of elements, the type could easily be rewritten on top of a Dictionary, and the Dictionary Values used as the source for IEnumerable. But, at this point, that is a big if. I don’t have any good suggestions for a solution strategy at this point. As InternalRelationshipCollection is internal one would need to analyze its call-sites to better understand what it is really used for, and perhaps one could replace the type altogether. I will try and get some time to come up with a possible solution, and then I guess a pull request would be possible. My concern is that I need a much better understanding of the codebase to see how this can be implemented with a reliable performance profile. I have only used this API indirectly through the Microsoft OpenXML SDK, and I have no clue what other uses it might have. |
I took a look at this and since the
I have working code that fixes the performance characteristics as well as maintains the ordering issue. I'll submit a PR and happy for any feedback. |
Here's a benchmark test to repro it: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using System;
using System.IO;
using System.IO.Packaging;
namespace SystemIOBenchmarks
{
[SimpleJob(RuntimeMoniker.NetCoreApp31)]
[MemoryDiagnoser]
public class BenchmarkClass
{
[Params(10, 100, 1000, 10000)]
public int N;
[Benchmark]
public Stream AddRelationships()
{
using (var ms = new MemoryStream())
using (var package = Package.Open(ms, FileMode.Create))
{
for (int i = 0; i < N; i++)
{
package.CreateRelationship(new Uri("http://localhost"), TargetMode.External, "RelationshipType");
}
return ms;
}
}
}
} |
Use case
The Microsoft OpenXml SDK relies on the System.IO.Packaging API to generate Microsoft Excel files. This allows adding hyperlinks to Excel columns, but with the current resource consumption this is impossible when you have an excel sheet with about 100.000 records that each has a link to some additional record information - for instance a deep link to a HTTP resource. This API uses PackagePart.CreateRelationship inside the System.IO.Packaging stack.
Background
When adding hyperlinks to Excel cells, you have to call AddHyperlinkRelationship on the OpenXmlContainer. This will then follow the stack in System.IO.Packaging:
PackagePart.CreateRelationship -> InternalRelationshipCollection.Add -> InternalRelationshipCollection.ValidateUniqueRelationshipId -> InternalRelationshipCollection.GetRelationshipIndex.
Problem
ValidateUniqueRelationshipId will call GetRelationshipIndex to ensure that the relationship id is unique, and it loops through all the links each time it is invoked. As the identifier list grows it will (obviously) become exponentially slower. Backing this identifier store by something similar to a HashSet would greatly improve the lookup time.
The text was updated successfully, but these errors were encountered: