Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize string building and interning #5663

Merged
merged 14 commits into from
Jan 21, 2021

Conversation

ladipro
Copy link
Member

@ladipro ladipro commented Aug 19, 2020

Overview

This PR addresses the following issues:

  1. The current version of string interning takes a custom IInternable interface as its input. Since most implementations of the interface are structs, the code is duplicated at run-time using generics to prevent boxing: void SomeFunction<T>(T internable) where T : IInternable.
  2. It's been pointed out that IInternable is functionally equivalent to ReadOnlySpan<char>. IInternable needs a separate and manually authored implementation for each kind of input, whereas ReadOnlySpan<char> comes from BCL for free. We are currently missing a wrapper over char *, length, for example.
  3. One of the string-like datatypes implementing IInternable is our StringBuilder wrapper. Interning iterates over the candidate string linearly, hashing its contents and comparing it with already interned strings. For StringBuilder this operation is O(N^2) because of its internal representation and the lack of an efficient enumerator.
  4. Most uses of StringBuilder in MSBuild are "append-only" where the resulting string is composed by concatenating a series of string fragments representable with ReadOnlySpan<char> or ReadOnlyMemory<char>. This usage pattern allocates 2x the length of the resulting string - one copy lives in the StringBuilder's internal char array(s) and the other is created as part of the ToString() call.

Introduced in this change is a new type named SpanBasedStringBuilder, serving as an efficient StringBuilder-like builder. Also provided is a static class named Strings with functionality to intern an arbitrary ReadOnlySpan<char>.

a) Interning read-only character spans
Makes it possible to intern a string, substring, character array/sub-array, char *, or any other ReadOnlySpan<char>. The advantage over first creating the desired System.String and then interning it is that it does away with the ephemeral System.String allocation in the case where the string already is in the intern table.

var str = Strings.WeakIntern("semicolon;delimited".AsSpan(0, 8));
Console.WriteLine(str); // prints "semicolon"

Memory characteristics of the above snippet:

  • The WeakIntern() call does not allocate if the string "semicolon" already exists in the intern table,
  • The call allocates the string and a small overhead if the string is not yet in the intern table.

b) An internable StringBuilder.
Compensates for the fact that System.Text.StringBuilder cannot be represented with one ReadOnlySpan<char> and that it does not provide a linear-time character enumeration. The usage pattern is similar to that of a StringBuilder but it does not allocate O(N) bytes where N is the intermediate string length, but rather allocates only spans describing its constituent pieces. Note that in degenerated cases this may be worse than O(N) so make sure it's used only where it's actually helping.

using var str = Strings.GetSpanBasedStringBuilder();
str.Append("semicolon");
str.Append(";");
str.Append("delimited");
Console.WriteLine(str.ToString()); // prints "semicolon;delimited"

Memory characteristics of the above snippet:

  • SpanBasedStringBuilder instances are pooled so in the expected case the GetSpanBasedStringBuilder() call does not allocate,
  • SpanBasedStringBuilder holds a List of span descriptors on the GC heap, the size is O(S) where S is the number of spans (in this example 3),
  • The Append() calls do not allocate memory unless they need to resize the List of span descriptors,
  • The ToString() call does not allocate the string "semicolon;delimited" if it already exists in the intern table.

Code structure

  • ReuseableStringBuilder is simplified as it no longer supports interning (going forward most of its uses should be replaced with SpanBasedStringBuilder),
  • All uses of ReusableStringBuilder that used interning are converted to SpanBasedStringBuilder,
  • All string/substring/char array interning is converted to Strings.WeakIntern(),
  • IInternable, WeakStringCache, OpportunisticIntern are all deleted and the functionality moved to a new library called StringTools,
  • StringTools is built for three targets: .NET Standard, .NET Framework 4, and .NET Framework 3.5 to be used in MSBuildTaskHost, and as usual, the 3.5 implementation is rather simplified as we're fine with sub-optimal performance,

Note that no effort has been made to convert other uses of ReuseableStringBuilder and StringBuilder to SpanBasedStringBuilder. The expectation is that there are many opportunities to eliminate ephemeral allocations and it should be done in separate PRs.

To-do

  • Run an experimental VS insertion
  • Measure performance impact by profiling project evaluation

@ladipro ladipro added the WIP label Aug 19, 2020
@ladipro ladipro mentioned this pull request Aug 26, 2020
@ladipro ladipro removed the WIP label Aug 28, 2020
@ladipro ladipro marked this pull request as ready for review August 28, 2020 09:38
@ladipro
Copy link
Member Author

ladipro commented Aug 28, 2020

Checks failed due to #5506

@ladipro
Copy link
Member Author

ladipro commented Aug 28, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@cdmihai
Copy link
Contributor

cdmihai commented Sep 10, 2020

Adding a new MSBuild assembly will cause a bit of propagation pain to all the tools that have hard coded lists with all the msbuild assemblies, like Microsoft.Build.Locator.


Refers to: src/StringTools/StringTools.csproj:1 in e1846cc. [](commit_id = e1846cc, deletion_comment = False)

@ladipro
Copy link
Member Author

ladipro commented Sep 10, 2020

Adding a new MSBuild assembly will cause a bit of propagation pain to all the tools that have hard coded lists with all the msbuild assemblies, like Microsoft.Build.Locator.

We have added several dependent .dll's recently so I was under the assumption that it is not a super disruptive change.

  • System.Text.Json.dll for solution filter support,
  • StreamJsonRpc.dll & friends for modern IPC serialization.

In case of this library it is unfortunately not possible to statically link it into MSBuild as sharing it among all components in the process is an explicit goal. I am curious to learn more about the implications of adding a dependency.

@cdmihai
Copy link
Contributor

cdmihai commented Sep 10, 2020

I am curious to learn more about the implications of adding a dependency.

As far as I'm aware:

  • code that hard codes msbuild assemblies will potentially need updating (like here or here)
  • app.config probably needs a new redirect
  • update all the packages we ship

@ladipro
Copy link
Member Author

ladipro commented Sep 10, 2020

@danmosemsft may I ask you to take a look at the StringTools library introduced in this PR? This is the library to be published and used by multiple components running in the devenv process that we recently agreed would initially live in the msbuild repo. Thank you!

@danmoseley
Copy link
Member

@ladipro thanks for sharing. I will try to look but maybe not today.

It sounds like the idea is that this would be a generic interning service for any managed code in VS?

As you know there are countless strategies one could use, from the PR it's only possible to review for correctness and not be confident about the approach. I would expect you plan to measure (in MSBuild and VS) and iterate up the wazoo. Including with non MSBuild consumers.

cc @jkotas in case he is interested in your strategies. Glad you are tackling this!

@danmoseley
Copy link
Member

This is the kind of thing @AArnott may be interested in.

@ladipro
Copy link
Member Author

ladipro commented Sep 14, 2020

It sounds like the idea is that this would be a generic interning service for any managed code in VS?

Yes. Still debating whether it needs to be a formal VS "service" but it's going to be at least a regular library. In the MSBuild (and I'm guessing also Roslyn) case, the same code runs out-of-proc as well as in-proc so it's convenient to simply link against a library and then automatically get process-wide string sharing, be it the devenv process or an out-of-proc worker.

As you know there are countless strategies one could use, from the PR it's only possible to review for correctness and not be confident about the approach. I would expect you plan to measure (in MSBuild and VS) and iterate up the wazoo. Including with non MSBuild consumers.

That's the idea. I'm starting with MSBuild and will then consume the library from other places as well. Nothing is set in stone and there are likely going to be adjustments, although the core ReadOnlySpan<char> -> string interning function is likely going to stay the way it is - it is super simple and should cover most scenarios.

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only about halfway through my review. Still need to look at tests and most of the places you used it.

Sorry if there are duplicates of others' comments; I started this a while ago and haven't always been keeping up since.

MSBuild.Dev.sln Outdated
@@ -1,7 +1,7 @@

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio 15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be reasonable to finally replace this with a .slnf.

MSBuild.Dev.sln Outdated Show resolved Hide resolved
src/Shared/StringToolsHost.cs Outdated Show resolved Hide resolved
src/Shared/StringToolsHost.cs Outdated Show resolved Hide resolved
src/StringTools/StringTools.cs Outdated Show resolved Hide resolved
src/StringTools/StringTools.cs Outdated Show resolved Hide resolved
src/StringTools/StringTools.cs Outdated Show resolved Hide resolved
src/StringTools/StringTools.cs Outdated Show resolved Hide resolved
src/StringTools/StringTools.cs Outdated Show resolved Hide resolved
Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works! Some questions and a couple thoughts on potential points of perf improvement, but this looks good!

src/Build/Evaluation/Expander.cs Outdated Show resolved Hide resolved
}
}

if (argumentStartIndex.HasValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do it this way instead of appending in the else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was appending to the argumentBuilder character by character. With the new data structure this would be inefficient so now it keeps track of the start of the current argument and appends the entire substring when it finds its end. This if statement is handling the case where the argument's last character is the last character of the input string so we don't detect the end inside the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a very similar question. Comment this?

src/Build/Evaluation/Expander.cs Show resolved Hide resolved
src/Directory.Build.targets Show resolved Hide resolved
src/Shared/PropertyParser.cs Show resolved Hide resolved
src/StringTools.UnitTests/OpportunisticIntern_Tests.cs Outdated Show resolved Hide resolved
src/StringTools.UnitTests/OpportunisticIntern_Tests.cs Outdated Show resolved Hide resolved
src/StringTools.UnitTests/StringTools_Tests.cs Outdated Show resolved Hide resolved
@ladipro
Copy link
Member Author

ladipro commented Dec 16, 2020

I have finally updated the PR and made it ready for final review. Thank you for bearing with me and for the great feedback so far. Notable changes since the last round:

  1. Hardcoded strings are gone. I took measurements without hardcoded strings, with hardcoded strings as they currently exist in the codebase, and with hardcoded strings tailored for a specific scenario (i.e. hardcoding top N strings seen when building a project).
    -- The strings currently in the main branch are not helping. Total time spent interning is actually ~1% worse.
    -- Hardcoding the right strings can make things better (by about 1%, interestingly) but the set of right strings varies across projects being built. Strings that make one project 1% faster may make little effect on another project or even make it slower. Btw, the most frequently interned strings are not "true", "false", "==" but rather things like "$(MSBuildProjectDirectory)", "@(ReferencePath->'%(ReferenceAssembly)')", and "$([MSBuild]::EnsureTrailingSlash($(MSBuildProjectDirectory)))".

    All in all, 1% of string interning cost (not build cost!) is definitely not worth the added complexity and maintenance cost.

  2. SpanBasedStringBuilder pooling is now done trivially with a thread-static variable rather than with ObjectPool. MSBuild does not have code where a given thread would need to hold on to multiple builders at the same time. It's always: 1. Get a builder, 2. Append to it, 3. ToString() it. 4. Dispose of it. This makes pooling faster and eliminates the dependency on Microsoft.Extensions.ObjectPool.

I've added a simple microbenchmark comparing

  • Appending to a regular StringBuilder and interning the result of ToString().
  • Appending to SpanBasedStringBuilder and getting the resulting string with ToString() (which has interning built in).
    As expected, the results vary wildly depending on how exactly the appending is done - how many strings of what length.

Composing the result out of 1 string with 64 characters is about a wash between the two, both on cache hit (string already interned) and cache miss (string seen for the first time).
In the degenerate case composing the result out of 1 string 1 character long, the regular SB is faster by about 40% on hit and 20% on miss (~200 ns difference in both cases).
On the other end of the spectrum, composing the result out of 16 strings 512 characters each, SpanBasedStringBuilder is faster by 15% on hit and by 10% on miss.

In all cases, though, SpanBasedStringBuilder is zero-alloc on hit whereas the regular StringBuilder allocates the ephemeral string in ToString. I'll run a macro-benchmark to quantifying the impact on project evaluation.

@rainersigwald, may I ask you to take a look. Thank you!

Copy link
Member Author

@ladipro ladipro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed Rainer's comments and rebased on top of current master. If all checks pass, I'll squash the fixup commits into the main work and it will be ready for merge.

@ladipro ladipro force-pushed the switch-interner-to-span branch from 40428b6 to a58570c Compare January 20, 2021 08:40
@ladipro
Copy link
Member Author

ladipro commented Jan 20, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ladipro ladipro force-pushed the switch-interner-to-span branch from a58570c to 580e481 Compare January 20, 2021 21:34
@ladipro ladipro merged commit 7c40815 into dotnet:master Jan 21, 2021
@ladipro ladipro deleted the switch-interner-to-span branch January 21, 2021 08:09
@ladipro
Copy link
Member Author

ladipro commented Jan 21, 2021

Visual Studio insertion with this change will require manual tweaks.

benvillalobos pushed a commit to benvillalobos/msbuild that referenced this pull request Feb 1, 2021
Optimize string building and interning

Introduces a new assembly Microsoft.NET.StringTools with string interning functionality and a span-based
string builder. Evaluation performance, in terms of memory allocations and time, is in single-digit percentage.

Detailed description in dotnet#5663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eliminate IInternable in favor of ReadOnlySpan<char> for OpportunisticIntern inputs
6 participants