Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Packaging.InternalRelationshipCollection.GetRelationshipIndex doesn't scale #983

Closed
mabrahamsen opened this issue Dec 17, 2019 · 5 comments · Fixed by #35978
Closed

Comments

@mabrahamsen
Copy link

mabrahamsen commented Dec 17, 2019

Use case
The Microsoft OpenXml SDK relies on the System.IO.Packaging API to generate Microsoft Excel files. This allows adding hyperlinks to Excel columns, but with the current resource consumption this is impossible when you have an excel sheet with about 100.000 records that each has a link to some additional record information - for instance a deep link to a HTTP resource. This API uses PackagePart.CreateRelationship inside the System.IO.Packaging stack.

Background
When adding hyperlinks to Excel cells, you have to call AddHyperlinkRelationship on the OpenXmlContainer. This will then follow the stack in System.IO.Packaging:
PackagePart.CreateRelationship -> InternalRelationshipCollection.Add -> InternalRelationshipCollection.ValidateUniqueRelationshipId -> InternalRelationshipCollection.GetRelationshipIndex.

Problem
ValidateUniqueRelationshipId will call GetRelationshipIndex to ensure that the relationship id is unique, and it loops through all the links each time it is invoked. As the identifier list grows it will (obviously) become exponentially slower. Backing this identifier store by something similar to a HashSet would greatly improve the lookup time.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.IO.Packaging untriaged New issue has not been triaged by the area owner labels Dec 17, 2019
@danmoseley
Copy link
Member

@carlossanlop would we accept a PR here?

@carlossanlop
Copy link
Member

carlossanlop commented Jan 13, 2020

Thank you for bringing this to our attention, @danmosemsft.

Here are my notes about this request:

  • Anyone consuming InternalRelationshipCollection may depend on the list-like behavior of the IEnumerable interface, so we need to make sure we don't affect the enumerable behavior.
  • HashSets typically inherit from ISet and/or IReadOnlySet (ISet would be the appropriate interface for this particular case, since it needs to be modified). The InternalRelationshipCollection class implements IEnumerable<PackageRelationship>. Changing the interface to ISet would mean implementing all the APIs of this new interface, and be careful to not affect the ones from IEnumerable.
  • We have a class that implements both IEnumerable and ISet: System.Collections.Generic.SortedSet, which may be able to solve this issue, but we would still need an investigation on any unexpected consequences of adding the extra functionality from this class. Keep in mind that SortedSet implements many interfaces which would have to be implemented: ICollection<T>, IEnumerable<T>, IReadOnlyCollection<T>, ISet<T>, ICollection, IEnumerable, IDeserializationCallback ISerializable.
  • I think a better option would be to use a dictionary instead of a set.
  • The name of the InternalRelationshipCollection could probably be changed to something more appropriate, since it would not be just a collection anymore. Optional.
  • Seems that the WindowsBase project is duplicating or depending on the code of System.IO.Packaging, and putting inside its own namespace, MS.Internal.IO.Packaging (see source.dot.net). We need to understand the relationship of these two. If there's a dependency, then we would probably be affecting an even larger set of APIs that depend on InternalRelationshipCollection. Update: I can't find that assembly anymore, it may have been removed.

@mabrahamsen, a couple of requests for you:

  • Would you be interested in addressing the above questions, and if the proposed fix seems feasible, work on a PR?
  • Can you please share a code snippet to test your use case?

@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Jan 13, 2020
@carlossanlop carlossanlop added this to the Future milestone Jan 13, 2020
@mabrahamsen
Copy link
Author

mabrahamsen commented Jan 24, 2020

Hi @carlossanlop,

I have extracted a simple sample and attached hit here. This will generate an Excel sheet using the Microsoft OpenXML SDK, and I have code with or without the hyperlink. You will see how the hyperlink version doesn't scale if you change the rowcounts. With the current rowcount a profiler will show you that about 98% of the time is used calling InternalRelationshipCollection.Add, to add a hyperlink per row.

WorksheetWriter.zip

Also, when I mentioned the HashSet, it was more to point towards a collection lookup strategy to avoid doing all the loops over the internal list. Not that I recommended changing the type itself to inherit from IDictionary<TKey, TValue> or ISet. I suspect that in additional to maintaining the IEumerable implementation, some clients might expect the collection to keep the sequence provided by the List implementation. One could partner the List with a HashSet to help with the Add uniqueness validation, but as the list supports index lookup (which could be solved with Dictionary) and Delete (which would require something similar to a Lazy HashSet/Dictionary rebuild) it becomes slightly more complicated. If the client doesn’t depend on the sequence of elements, the type could easily be rewritten on top of a Dictionary, and the Dictionary Values used as the source for IEnumerable. But, at this point, that is a big if.

I don’t have any good suggestions for a solution strategy at this point. As InternalRelationshipCollection is internal one would need to analyze its call-sites to better understand what it is really used for, and perhaps one could replace the type altogether.

I will try and get some time to come up with a possible solution, and then I guess a pull request would be possible. My concern is that I need a much better understanding of the codebase to see how this can be implemented with a reliable performance profile. I have only used this API indirectly through the Microsoft OpenXML SDK, and I have no clue what other uses it might have.

@twsouthwick
Copy link
Member

twsouthwick commented May 7, 2020

I took a look at this and since the InternalRelationshipCollection is internal, the only public aspect of it is that is passed as a IEnumerable<PackageRelationship>. Thus, the following behavior is needed:

  • Needs to maintain the same order as the items added to it. It does not appear to be required, but this is a behavior that it currently has and I'm assuming that shouldn't be changed
  • Needs to be able to add/remove/find in reasonable time. This is only used internally

I have working code that fixes the performance characteristics as well as maintains the ordering issue. I'll submit a PR and happy for any feedback.

@twsouthwick
Copy link
Member

twsouthwick commented May 7, 2020

Here's a benchmark test to repro it:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using System;
using System.IO;
using System.IO.Packaging;

namespace SystemIOBenchmarks
{
    [SimpleJob(RuntimeMoniker.NetCoreApp31)]
    [MemoryDiagnoser]
    public class BenchmarkClass
    {
        [Params(10, 100, 1000, 10000)]
        public int N;

        [Benchmark]
        public Stream AddRelationships()
        {
            using (var ms = new MemoryStream())
            using (var package = Package.Open(ms, FileMode.Create))
            {
                for (int i = 0; i < N; i++)
                {
                    package.CreateRelationship(new Uri("http://localhost"), TargetMode.External, "RelationshipType");
                }

                return ms;
            }
        }
    }
}

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants