Introduce more efficient internal representation of a sequence VirtualChars #33834

CyrusNajmabadi · 2019-03-02T22:06:08Z

Review with https://github.com/dotnet/roslyn/pull/33834/files?w=1 for better diffs.

Prior to this PR, the way regex worked was to first start with a string-SyntaxToken. This token was then converted into an ImmutableArray where each char of the string was 'interpretted' into 'VirtualChars' and then placed in the array. For a normal character (like 'a'), this interpretation had no effect. But we did it to be able to understand that \\ in text would mean \ as an actual char value.

After this, this sequence of 'VirtualChars' was then lexed and parsed if appropriate by the regex engine. When lexing, sub-sequences of this array were then placed into the RegexTokens being created. Finally, these tokens are parsed into a tree.

This can end up with a fairly large number of allocations. First, you're gettin an array created for each string. And that array goes from being sizeof(char) * N to sizeof(VirtualChar) * N. Then, each token/trivia will consume and copy part of that array, effectively leading to double that.

--

This PR changes things in a couple of ways.

There is now a new abstraction called VirtualCharSequence. It exposes the needed surface area that features need (i.e. it's similar to an ImmutableArray<VirtualChar>), but allows for more efficient internal implementations.
One important implementation is the 'sits on a string' version. this version is used when the original string had no escaped sequences in it. i.e. a string like "abc". In this case, we just create a single object that points at that text and says "i represents the chars from abc". This means that for most strings in a project, we can represent the sequence of chars with just a single allocation.
It's efficient (one extra object) to get a sub-sequence of a VirtualCharSequence. So, when all the trivia/tokens get lexed for a regex, we can just point at the right information without copying out an entire array.

CyrusNajmabadi · 2019-03-02T22:06:34Z

Tagging @sharwell as he mentioned seeing VirtualChar info in traces.

CyrusNajmabadi · 2019-03-03T20:03:32Z

src/Workspaces/Core/Portable/EmbeddedLanguages/RegularExpressions/RegexLexer.cs

-            }
-
-            return result.ToImmutableAndFree();
-        }


Large source of allocations removed.

now, when a Token/Trivia want to point at a range of characters, they can do it in an alloc-free manner. They just et a VirtualCharSequence (A struct) which is a sub-span of the original VirtualCharSequence.

CyrusNajmabadi · 2019-03-03T20:05:46Z

src/Workspaces/Core/Portable/EmbeddedLanguages/RegularExpressions/RegexParser.cs

-                var last = list.Count == 0 ? null : list.Last();
-                list.Add(ParsePrimaryExpressionAndQuantifiers(last));
-
-                TryMergeLastTwoNodes(list);


O(n^2) for memory and cpu. If you had a long sequence of text nodes (not at all hard if you just have abcdefgh...), then each text node would concat with the previous, recopying all of the previous node's contents and adding in the new text.

New merging algorithm is linear.

Note: i gave some thought if we could make it so we can avoid merging the text nodes in the first place and just produce merged text tokens, but it proved challenging.

CyrusNajmabadi · 2019-03-03T20:06:37Z

src/Workspaces/Core/Portable/EmbeddedLanguages/RegularExpressions/RegexParser.cs

                {
                    // Allow trivia after the character class, and whatever is next in the sequence.
                    closeBracketToken = ConsumeCurrentToken(allowTrivia: true);
                    break;
                }

-                ParseCharacterClassComponents(contents);
-                TryMergeLastTwoNodes(contents);
+                ParseCharacterClassComponents(builder);


same issue as above. O(n^2) is now linear in time and memory.

CyrusNajmabadi · 2019-03-03T20:18:13Z

Ok @sharwell this is ready for review now.

CyrusNajmabadi · 2019-03-03T20:20:30Z

src/Workspaces/Core/Portable/EmbeddedLanguages/RegularExpressions/RegexLexer.cs

            }

            var ch = this.CurrentChar;
            Position++;

-            return CreateToken(GetKind(ch), trivia, ImmutableArray.Create(ch));
+            return CreateToken(GetKind(ch), trivia, Text.GetSubSequence(new TextSpan(Position - 1, 1)));


allocation removed.

src/Workspaces/Core/Portable/EmbeddedLanguages/VirtualChars/VirtualCharSequence.cs

jcouv

Done with review pass (iteration 27)

CyrusNajmabadi · 2019-03-04T22:04:58Z

@jcouv Thanks for the suggestion on StringChunk. IT really did make it simpler (both in impl, and conceptually!).

jcouv

LGTM Thanks (iteration 28)

CyrusNajmabadi · 2019-03-04T23:15:04Z

@jcouv can you merge in?

jcouv · 2019-03-04T23:33:42Z

Need an IDE sign-off. Tagging @jinujoseph

CyrusNajmabadi · 2019-03-04T23:43:00Z

Need an IDE sign-off. Tagging @jinujoseph

How are you not an honorary IDE member yet? :)

CyrusNajmabadi · 2019-03-04T23:43:10Z

Perhaps @ryzngard can help out here?

ryzngard · 2019-03-05T00:34:52Z

src/Workspaces/Core/Portable/EmbeddedLanguages/VirtualChars/VirtualCharSequence.cs

+        public int IndexOf(VirtualChar @char)
+        {
+            int index = 0;
+            foreach (var ch in this)


nit: why do a foreach when you already have an index? for seems appropriate here

foreach can be more efficient if the underlying representation changes. For example, if we cahnge this to a tree in the future. However, indexin may end up being O(n log n). This style keeps us safe.

ryzngard

LGTM. The VirtualCharSequence was surprisingly complex, but I don't see a better way to do it. Linear parsing is much better 💯

CyrusNajmabadi · 2019-03-05T06:11:02Z

@ryzngard @jcouv can one of you merge in?

CyrusNajmabadi · 2019-03-05T06:29:30Z

Thanks!

sharwell · 2019-03-05T14:01:49Z

@CyrusNajmabadi Were you able to think of a standard benchmark scenario that would stress this section of code (something we can reference in the future when we want to make changes)?

CyrusNajmabadi · 2019-03-05T19:03:07Z

a regex containing something like this: now is the time for all good men to come to the aid of their party (going on as long as you want).

This is not actually unreasonable as someone may be trying match some large textual string with additional regexy things around it.

The above is the worst case on all axes. First, we create a large VirtualCharArray for the entire text. Then each character becomes a node in the tree, with an array allocation for each character (N + N allocations). Then, we make that into one contiguous text-node (so that the final tree is small). However, the process of doing that produced O(n²) garbage and would run O(n²).

Post this change we will still produce N nodes during parsing, but condense down to 1-2 nodes at the end. Importantly, we never make the N intermediary arrays initially. And we don't make the n² arrays while condensing.

CyrusNajmabadi · 2019-03-05T19:05:58Z

Also, just for normal strings (i.e. not regex), this should now just be a lot better. If the string doesn't contain escapes (likely the common case), then we represent it in memory with just a single allocation. So you could have a humongous string (or tons of tiny strings) and we now allocate O(count of strings) vs O(size of strings). The former obviously being much smaller than the latter.

Where i have not improved things is on non-regex strings with escapes in them. These are effectively as expensive as before.

I have some ideas on how to improve that case (i.e. by storing a tree of 'chunks' internally), but it will dramatically increase the complexity. Simpler approaches (like a linear sequence of 'chunks') could improve memory, but might have terrible CPU in pathological cases.

CyrusNajmabadi added 4 commits March 2, 2019 12:56

Virtual char sequence abstraction

216b330

compiling

0924768

Fixes

f6e246d

tests

27f7fd5

CyrusNajmabadi requested a review from a team as a code owner March 2, 2019 22:06

CyrusNajmabadi added 14 commits March 2, 2019 14:08

cleanup

c331f91

optimize

788f5e5

Improve docs

e0eacd8

Simplify

eca2b35

Delete

856a92f

Simplify

f6b5b6d

Less allocations

acd1ed5

Simplify

ac824eb

Simplyf

be6df98

REstore

8c4efaa

REstore

aed2914

REstore

3adac8f

restore

a7e29fa

restore

41e7025

CyrusNajmabadi commented Mar 3, 2019

View reviewed changes

CyrusNajmabadi added 4 commits March 3, 2019 12:08

restore

4bdeaec

restore

85a9027

restore

5be36ea

restore

93e9edf

CyrusNajmabadi commented Mar 3, 2019

View reviewed changes

restore

fa03f71