-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize string building and interning #5663
Conversation
Checks failed due to #5506 |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
We have added several dependent .dll's recently so I was under the assumption that it is not a super disruptive change.
In case of this library it is unfortunately not possible to statically link it into MSBuild as sharing it among all components in the process is an explicit goal. I am curious to learn more about the implications of adding a dependency. |
As far as I'm aware:
|
@danmosemsft may I ask you to take a look at the StringTools library introduced in this PR? This is the library to be published and used by multiple components running in the devenv process that we recently agreed would initially live in the msbuild repo. Thank you! |
@ladipro thanks for sharing. I will try to look but maybe not today. It sounds like the idea is that this would be a generic interning service for any managed code in VS? As you know there are countless strategies one could use, from the PR it's only possible to review for correctness and not be confident about the approach. I would expect you plan to measure (in MSBuild and VS) and iterate up the wazoo. Including with non MSBuild consumers. cc @jkotas in case he is interested in your strategies. Glad you are tackling this! |
This is the kind of thing @AArnott may be interested in. |
Yes. Still debating whether it needs to be a formal VS "service" but it's going to be at least a regular library. In the MSBuild (and I'm guessing also Roslyn) case, the same code runs out-of-proc as well as in-proc so it's convenient to simply link against a library and then automatically get process-wide string sharing, be it the devenv process or an out-of-proc worker.
That's the idea. I'm starting with MSBuild and will then consume the library from other places as well. Nothing is set in stone and there are likely going to be adjustments, although the core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only about halfway through my review. Still need to look at tests and most of the places you used it.
Sorry if there are duplicates of others' comments; I started this a while ago and haven't always been keeping up since.
MSBuild.Dev.sln
Outdated
@@ -1,7 +1,7 @@ | |||
| |||
Microsoft Visual Studio Solution File, Format Version 12.00 | |||
# Visual Studio 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be reasonable to finally replace this with a .slnf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works! Some questions and a couple thoughts on potential points of perf improvement, but this looks good!
src/Build/Evaluation/Expander.cs
Outdated
} | ||
} | ||
|
||
if (argumentStartIndex.HasValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do it this way instead of appending in the else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code was appending to the argumentBuilder
character by character. With the new data structure this would be inefficient so now it keeps track of the start of the current argument and appends the entire substring when it finds its end. This if
statement is handling the case where the argument's last character is the last character of the input string so we don't detect the end inside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a very similar question. Comment this?
src/StringTools.UnitTests/SpanBasedStringBuilder.Simple_Tests.cs
Outdated
Show resolved
Hide resolved
I have finally updated the PR and made it ready for final review. Thank you for bearing with me and for the great feedback so far. Notable changes since the last round:
I've added a simple microbenchmark comparing
Composing the result out of 1 string with 64 characters is about a wash between the two, both on cache hit (string already interned) and cache miss (string seen for the first time). In all cases, though, @rainersigwald, may I ask you to take a look. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed Rainer's comments and rebased on top of current master
. If all checks pass, I'll squash the fixup commits into the main work and it will be ready for merge.
40428b6
to
a58570c
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
a58570c
to
580e481
Compare
Visual Studio insertion with this change will require manual tweaks. |
Optimize string building and interning Introduces a new assembly Microsoft.NET.StringTools with string interning functionality and a span-based string builder. Evaluation performance, in terms of memory allocations and time, is in single-digit percentage. Detailed description in dotnet#5663
Overview
This PR addresses the following issues:
IInternable
interface as its input. Since most implementations of the interface are structs, the code is duplicated at run-time using generics to prevent boxing:void SomeFunction<T>(T internable) where T : IInternable
.IInternable
is functionally equivalent toReadOnlySpan<char>
.IInternable
needs a separate and manually authored implementation for each kind of input, whereasReadOnlySpan<char>
comes from BCL for free. We are currently missing a wrapper overchar *, length
, for example.IInternable
is ourStringBuilder
wrapper. Interning iterates over the candidate string linearly, hashing its contents and comparing it with already interned strings. ForStringBuilder
this operation is O(N^2) because of its internal representation and the lack of an efficient enumerator.StringBuilder
in MSBuild are "append-only" where the resulting string is composed by concatenating a series of string fragments representable withReadOnlySpan<char>
orReadOnlyMemory<char>
. This usage pattern allocates 2x the length of the resulting string - one copy lives in theStringBuilder
's internal char array(s) and the other is created as part of theToString()
call.Introduced in this change is a new type named
SpanBasedStringBuilder
, serving as an efficientStringBuilder
-like builder. Also provided is a static class namedStrings
with functionality to intern an arbitraryReadOnlySpan<char>
.a) Interning read-only character spans
Makes it possible to intern a string, substring, character array/sub-array, char *, or any other
ReadOnlySpan<char>
. The advantage over first creating the desiredSystem.String
and then interning it is that it does away with the ephemeralSystem.String
allocation in the case where the string already is in the intern table.Memory characteristics of the above snippet:
b) An internable
StringBuilder
.Compensates for the fact that
System.Text.StringBuilder
cannot be represented with oneReadOnlySpan<char>
and that it does not provide a linear-time character enumeration. The usage pattern is similar to that of aStringBuilder
but it does not allocate O(N) bytes where N is the intermediate string length, but rather allocates only spans describing its constituent pieces. Note that in degenerated cases this may be worse than O(N) so make sure it's used only where it's actually helping.Memory characteristics of the above snippet:
SpanBasedStringBuilder
instances are pooled so in the expected case the GetSpanBasedStringBuilder() call does not allocate,SpanBasedStringBuilder
holds a List of span descriptors on the GC heap, the size is O(S) where S is the number of spans (in this example 3),Append()
calls do not allocate memory unless they need to resize the List of span descriptors,ToString()
call does not allocate the string "semicolon;delimited" if it already exists in the intern table.Code structure
ReuseableStringBuilder
is simplified as it no longer supports interning (going forward most of its uses should be replaced withSpanBasedStringBuilder
),ReusableStringBuilder
that used interning are converted toSpanBasedStringBuilder
,Strings.WeakIntern()
,IInternable
,WeakStringCache
,OpportunisticIntern
are all deleted and the functionality moved to a new library calledStringTools
,StringTools
is built for three targets: .NET Standard, .NET Framework 4, and .NET Framework 3.5 to be used in MSBuildTaskHost, and as usual, the 3.5 implementation is rather simplified as we're fine with sub-optimal performance,Note that no effort has been made to convert other uses of
ReuseableStringBuilder
andStringBuilder
toSpanBasedStringBuilder
. The expectation is that there are many opportunities to eliminate ephemeral allocations and it should be done in separate PRs.To-do