Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce more efficient internal representation of a sequence VirtualChars #33834

Merged
merged 28 commits into from
Mar 5, 2019

Conversation

CyrusNajmabadi
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi commented Mar 2, 2019

Review with https://github.com/dotnet/roslyn/pull/33834/files?w=1 for better diffs.

Prior to this PR, the way regex worked was to first start with a string-SyntaxToken. This token was then converted into an ImmutableArray where each char of the string was 'interpretted' into 'VirtualChars' and then placed in the array. For a normal character (like 'a'), this interpretation had no effect. But we did it to be able to understand that \\ in text would mean \ as an actual char value.

After this, this sequence of 'VirtualChars' was then lexed and parsed if appropriate by the regex engine. When lexing, sub-sequences of this array were then placed into the RegexTokens being created. Finally, these tokens are parsed into a tree.

This can end up with a fairly large number of allocations. First, you're gettin an array created for each string. And that array goes from being sizeof(char) * N to sizeof(VirtualChar) * N. Then, each token/trivia will consume and copy part of that array, effectively leading to double that.

--

This PR changes things in a couple of ways.

  1. There is now a new abstraction called VirtualCharSequence. It exposes the needed surface area that features need (i.e. it's similar to an ImmutableArray<VirtualChar>), but allows for more efficient internal implementations.
  2. One important implementation is the 'sits on a string' version. this version is used when the original string had no escaped sequences in it. i.e. a string like "abc". In this case, we just create a single object that points at that text and says "i represents the chars from abc". This means that for most strings in a project, we can represent the sequence of chars with just a single allocation.
  3. It's efficient (one extra object) to get a sub-sequence of a VirtualCharSequence. So, when all the trivia/tokens get lexed for a regex, we can just point at the right information without copying out an entire array.

@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner March 2, 2019 22:06
@CyrusNajmabadi
Copy link
Member Author

Tagging @sharwell as he mentioned seeing VirtualChar info in traces.

}

return result.ToImmutableAndFree();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large source of allocations removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, when a Token/Trivia want to point at a range of characters, they can do it in an alloc-free manner. They just et a VirtualCharSequence (A struct) which is a sub-span of the original VirtualCharSequence.

var last = list.Count == 0 ? null : list.Last();
list.Add(ParsePrimaryExpressionAndQuantifiers(last));

TryMergeLastTwoNodes(list);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(n^2) for memory and cpu. If you had a long sequence of text nodes (not at all hard if you just have abcdefgh...), then each text node would concat with the previous, recopying all of the previous node's contents and adding in the new text.

New merging algorithm is linear.

Note: i gave some thought if we could make it so we can avoid merging the text nodes in the first place and just produce merged text tokens, but it proved challenging.

{
// Allow trivia after the character class, and whatever is next in the sequence.
closeBracketToken = ConsumeCurrentToken(allowTrivia: true);
break;
}

ParseCharacterClassComponents(contents);
TryMergeLastTwoNodes(contents);
ParseCharacterClassComponents(builder);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue as above. O(n^2) is now linear in time and memory.

@CyrusNajmabadi
Copy link
Member Author

Ok @sharwell this is ready for review now.

}

var ch = this.CurrentChar;
Position++;

return CreateToken(GetKind(ch), trivia, ImmutableArray.Create(ch));
return CreateToken(GetKind(ch), trivia, Text.GetSubSequence(new TextSpan(Position - 1, 1)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocation removed.

Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with review pass (iteration 27)

@CyrusNajmabadi
Copy link
Member Author

@jcouv Thanks for the suggestion on StringChunk. IT really did make it simpler (both in impl, and conceptually!).

Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks (iteration 28)

@CyrusNajmabadi
Copy link
Member Author

@jcouv can you merge in?

@jcouv
Copy link
Member

jcouv commented Mar 4, 2019

Need an IDE sign-off. Tagging @jinujoseph

@jinujoseph jinujoseph added the Community The pull request was submitted by a contributor who is not a Microsoft employee. label Mar 4, 2019
@CyrusNajmabadi
Copy link
Member Author

Need an IDE sign-off. Tagging @jinujoseph

How are you not an honorary IDE member yet? :)

@CyrusNajmabadi
Copy link
Member Author

Perhaps @ryzngard can help out here?

public int IndexOf(VirtualChar @char)
{
int index = 0;
foreach (var ch in this)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why do a foreach when you already have an index? for seems appropriate here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach can be more efficient if the underlying representation changes. For example, if we cahnge this to a tree in the future. However, indexin may end up being O(n log n). This style keeps us safe.

Copy link
Contributor

@ryzngard ryzngard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The VirtualCharSequence was surprisingly complex, but I don't see a better way to do it. Linear parsing is much better 💯

@CyrusNajmabadi
Copy link
Member Author

@ryzngard @jcouv can one of you merge in?

@ryzngard ryzngard merged commit 95a6f7b into dotnet:master Mar 5, 2019
@CyrusNajmabadi
Copy link
Member Author

Thanks!

@CyrusNajmabadi CyrusNajmabadi deleted the virtualChars branch March 5, 2019 06:29
@sharwell
Copy link
Member

sharwell commented Mar 5, 2019

@CyrusNajmabadi Were you able to think of a standard benchmark scenario that would stress this section of code (something we can reference in the future when we want to make changes)?

@CyrusNajmabadi
Copy link
Member Author

a regex containing something like this: now is the time for all good men to come to the aid of their party (going on as long as you want).

This is not actually unreasonable as someone may be trying match some large textual string with additional regexy things around it.

The above is the worst case on all axes. First, we create a large VirtualCharArray for the entire text. Then each character becomes a node in the tree, with an array allocation for each character (N + N allocations). Then, we make that into one contiguous text-node (so that the final tree is small). However, the process of doing that produced O(n2) garbage and would run O(n2).

Post this change we will still produce N nodes during parsing, but condense down to 1-2 nodes at the end. Importantly, we never make the N intermediary arrays initially. And we don't make the n2 arrays while condensing.

@CyrusNajmabadi
Copy link
Member Author

Also, just for normal strings (i.e. not regex), this should now just be a lot better. If the string doesn't contain escapes (likely the common case), then we represent it in memory with just a single allocation. So you could have a humongous string (or tons of tiny strings) and we now allocate O(count of strings) vs O(size of strings). The former obviously being much smaller than the latter.

Where i have not improved things is on non-regex strings with escapes in them. These are effectively as expensive as before.

I have some ideas on how to improve that case (i.e. by storing a tree of 'chunks' internally), but it will dramatically increase the complexity. Simpler approaches (like a linear sequence of 'chunks') could improve memory, but might have terrible CPU in pathological cases.

@sharwell sharwell mentioned this pull request Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants