-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce more efficient internal representation of a sequence VirtualChars #33834
Conversation
Tagging @sharwell as he mentioned seeing VirtualChar info in traces. |
} | ||
|
||
return result.ToImmutableAndFree(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large source of allocations removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now, when a Token/Trivia want to point at a range of characters, they can do it in an alloc-free manner. They just et a VirtualCharSequence (A struct) which is a sub-span of the original VirtualCharSequence.
var last = list.Count == 0 ? null : list.Last(); | ||
list.Add(ParsePrimaryExpressionAndQuantifiers(last)); | ||
|
||
TryMergeLastTwoNodes(list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O(n^2)
for memory and cpu. If you had a long sequence of text nodes (not at all hard if you just have abcdefgh...
), then each text node would concat with the previous, recopying all of the previous node's contents and adding in the new text.
New merging algorithm is linear.
Note: i gave some thought if we could make it so we can avoid merging the text nodes in the first place and just produce merged text tokens, but it proved challenging.
{ | ||
// Allow trivia after the character class, and whatever is next in the sequence. | ||
closeBracketToken = ConsumeCurrentToken(allowTrivia: true); | ||
break; | ||
} | ||
|
||
ParseCharacterClassComponents(contents); | ||
TryMergeLastTwoNodes(contents); | ||
ParseCharacterClassComponents(builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue as above. O(n^2) is now linear in time and memory.
Ok @sharwell this is ready for review now. |
} | ||
|
||
var ch = this.CurrentChar; | ||
Position++; | ||
|
||
return CreateToken(GetKind(ch), trivia, ImmutableArray.Create(ch)); | ||
return CreateToken(GetKind(ch), trivia, Text.GetSubSequence(new TextSpan(Position - 1, 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocation removed.
src/Workspaces/Core/Portable/EmbeddedLanguages/VirtualChars/VirtualCharSequence.cs
Show resolved
Hide resolved
src/Workspaces/Core/Portable/EmbeddedLanguages/VirtualChars/VirtualCharSequence.cs
Show resolved
Hide resolved
src/Workspaces/Core/Portable/EmbeddedLanguages/VirtualChars/VirtualCharSequence.cs
Show resolved
Hide resolved
src/Workspaces/Core/Portable/EmbeddedLanguages/VirtualChars/VirtualCharSequence.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with review pass (iteration 27)
@jcouv Thanks for the suggestion on StringChunk. IT really did make it simpler (both in impl, and conceptually!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks (iteration 28)
@jcouv can you merge in? |
Need an IDE sign-off. Tagging @jinujoseph |
How are you not an honorary IDE member yet? :) |
Perhaps @ryzngard can help out here? |
public int IndexOf(VirtualChar @char) | ||
{ | ||
int index = 0; | ||
foreach (var ch in this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why do a foreach
when you already have an index? for
seems appropriate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach
can be more efficient if the underlying representation changes. For example, if we cahnge this to a tree in the future. However, indexin may end up being O(n log n)
. This style keeps us safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The VirtualCharSequence
was surprisingly complex, but I don't see a better way to do it. Linear parsing is much better 💯
Thanks! |
@CyrusNajmabadi Were you able to think of a standard benchmark scenario that would stress this section of code (something we can reference in the future when we want to make changes)? |
a regex containing something like this: This is not actually unreasonable as someone may be trying match some large textual string with additional regexy things around it. The above is the worst case on all axes. First, we create a large VirtualCharArray for the entire text. Then each character becomes a node in the tree, with an array allocation for each character (N + N allocations). Then, we make that into one contiguous text-node (so that the final tree is small). However, the process of doing that produced O(n2) garbage and would run O(n2). Post this change we will still produce N nodes during parsing, but condense down to 1-2 nodes at the end. Importantly, we never make the N intermediary arrays initially. And we don't make the n2 arrays while condensing. |
Also, just for normal strings (i.e. not regex), this should now just be a lot better. If the string doesn't contain escapes (likely the common case), then we represent it in memory with just a single allocation. So you could have a humongous string (or tons of tiny strings) and we now allocate Where i have not improved things is on non-regex strings with escapes in them. These are effectively as expensive as before. I have some ideas on how to improve that case (i.e. by storing a tree of 'chunks' internally), but it will dramatically increase the complexity. Simpler approaches (like a linear sequence of 'chunks') could improve memory, but might have terrible CPU in pathological cases. |
Review with https://github.com/dotnet/roslyn/pull/33834/files?w=1 for better diffs.
Prior to this PR, the way regex worked was to first start with a string-SyntaxToken. This token was then converted into an ImmutableArray where each char of the string was 'interpretted' into 'VirtualChars' and then placed in the array. For a normal character (like 'a'), this interpretation had no effect. But we did it to be able to understand that
\\
in text would mean\
as an actual char value.After this, this sequence of 'VirtualChars' was then lexed and parsed if appropriate by the regex engine. When lexing, sub-sequences of this array were then placed into the RegexTokens being created. Finally, these tokens are parsed into a tree.
This can end up with a fairly large number of allocations. First, you're gettin an array created for each string. And that array goes from being
sizeof(char) * N
tosizeof(VirtualChar) * N
. Then, each token/trivia will consume and copy part of that array, effectively leading to double that.--
This PR changes things in a couple of ways.
ImmutableArray<VirtualChar>
), but allows for more efficient internal implementations."abc"
. In this case, we just create a single object that points at that text and says "i represents the chars fromabc
". This means that for most strings in a project, we can represent the sequence of chars with just a single allocation.