Add limited support for backtracking Regex single char loops to simplified code gen #60385

stephentoub · 2021-10-14T04:09:08Z

In .NET 5, we added simpler compiled code gen for regexes that didn't entail backtracking (or that had only very constrained backtracking, such as a top-level alternation). In our corpus of ~90K regular expressions, that code generator is employed for ~40% of them. The primary purpose of adding that code generator initially was performance, as it was able to avoid lots of the expense that original code generator had, especially for simple regexes. However, with the source generator, it's much more valuable to use this code gen as the generated code is human-readable and really helps to understand how the regex is operating, is much more easily debugged, etc.

This change allows the simplified code gen to be used even if there are backtracking single-character loops in the regex, as long as those loops are in a top-level concatenation (or a simple grouping structure like a capture). This increases the percentage of expressions in our corpus that will use the simplified code gen to ~65%.

Once we have the simplified loop code gen, it's also a lot easier to add in vectorization of searching for the next location to back off to based on a literal that comes immediately after the loop (e.g. "abc.*def"). This adds support into both RegexOptions.Compiled and the source generator to use LastIndexOf in that case.

The change also entailed adding/updating a few recursive functions. The plan has been to adopt the same model as in System.Linq.Expressions, Roslyn, and elsewhere, where we fork processing to continue on a secondary thread, rather than trying to enforce some max depth or rewrite as iterative, so I've done that as part of this change as well.

As an example, the "email" benchmark from:
https://github.com/mariomka/regex-benchmark/blame/244ca6c0e4bc8dd257904c51c0b5cabba6956dd2/csharp/Benchmark.cs#L19-L20

private readonly static Regex s_email = new Regex(@"[\w\.+-]+@[\w\.-]+\.[\w\.-]+", RegexOptions.Compiled);

[Benchmark]
public int Email() => Count(s_email, s_mariomkaInput);

private static int Count(Regex r, string input)
{
    int count = 0;
    Match m = r.Match(input);
    while (m.Success)
    {
        count++;
        m = m.NextMatch();
    }
    return count;
}

Method	Toolchain	Mean	Ratio
Email	\main\corerun.exe	600.2 us	1.00
Email	\pr\corerun.exe	485.2 us	0.81

And here's the generated code for the matching logic before and after...

Before

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextbeg = base.runtextbeg;
    int runtextend = base.runtextend;
    int runtextpos = base.runtextpos;
    int[] runtrack = base.runtrack!;
    int runtrackpos = base.runtrackpos;
    int[] runstack = base.runstack!;
    int runstackpos = base.runstackpos;
    int tmp1, tmp2, ch;
                    
    // 000000 *Lazybranch       addr = 29
    L0:
    runtrack[--runtrackpos] = runtextpos;
    runtrack[--runtrackpos] = 0;
                    
    // 000002 *Setmark
    L1:
    runstack[--runstackpos] = runtextpos;
    runtrack[--runtrackpos] = 1;
                    
    // 000003  Setrep           [+-.\\w], rep = 1
    L2:
    if (runtextend - runtextpos < 1)
    {
        goto Backtrack;
    }
    for (int i = 0; i < 1; i++)
    {
        if (!((ch = runtext[runtextpos + i]) < 128 ? ("\0\0栀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0004\n+,-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            goto Backtrack;
        }
    }
    runtextpos++;
                    
    // 000006  Setloopatomic    [+-.\\w], rep = inf
    L3:
    tmp1 = runtextend - runtextpos; // length
    tmp2 = tmp1 + 1;
    while (--tmp2 > 0)
    {
        if (!((ch = runtext[runtextpos++]) < 128 ? ("\0\0栀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0004\n+,-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            runtextpos--;
            break;
        }
    }
                    
    // 000009  UpdateBumpalong
    L4:
    runtrack[^1] = runtextpos;
                    
    // 000010  One              '@'
    L5:
    if (runtextpos >= runtextend || runtext[runtextpos++] != 64)
    {
        goto Backtrack;
    }
                    
    // 000012  Setrep           [-.\\w], rep = 1
    L6:
    if (runtextend - runtextpos < 1)
    {
        goto Backtrack;
    }
    for (int i = 0; i < 1; i++)
    {
        if (!((ch = runtext[runtextpos + i]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            goto Backtrack;
        }
    }
    runtextpos++;
                    
    // 000015 *Setloop          [-.\\w], rep = inf
    L7:
    tmp1 = runtextend - runtextpos; // length
    tmp2 = tmp1 + 1;
    while (--tmp2 > 0)
    {
        if (!((ch = runtext[runtextpos++]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            runtextpos--;
            break;
        }
    }
    if (tmp2 >= tmp1)
    {
        goto L8;
    }
    runtrack[--runtrackpos] = tmp1 - tmp2 - 1;
    runtrack[--runtrackpos] = runtextpos - 1;
    runtrack[--runtrackpos] = 2;
                    
    // 000018  One              '.'
    L8:
    if (runtextpos >= runtextend || runtext[runtextpos++] != 46)
    {
        goto Backtrack;
    }
                    
    // 000020  Setrep           [-.\\w], rep = 1
    L9:
    if (runtextend - runtextpos < 1)
    {
        goto Backtrack;
    }
    for (int i = 0; i < 1; i++)
    {
        if (!((ch = runtext[runtextpos + i]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            goto Backtrack;
        }
    }
    runtextpos++;
                    
    // 000023  Setloopatomic    [-.\\w], rep = inf
    L10:
    tmp1 = runtextend - runtextpos; // length
    tmp2 = tmp1 + 1;
    while (--tmp2 > 0)
    {
        if (!((ch = runtext[runtextpos++]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            runtextpos--;
            break;
        }
    }
                    
    // 000026 *Capturemark      index = 0
    L11:
    tmp1 = runstack[runstackpos++];
    base.Capture(0, tmp1, runtextpos);
    runtrack[--runtrackpos] = tmp1;
    runtrack[--runtrackpos] = 3;
                    
    // 000029  Stop
    L12:
    base.runtextpos = runtextpos;
    return;
                    
    Backtrack:
    int limit = base.runtrackcount * 4;
    if (runstackpos < limit)
    {
        base.runstackpos = runstackpos;
        base.DoubleStack(); // might change runstackpos and runstack
        runstackpos = base.runstackpos;
        runstack = base.runstack!;
    }
    if (runtrackpos < limit)
    {
        base.runtrackpos = runtrackpos;
        base.DoubleTrack(); // might change runtrackpos and runtrack
        runtrackpos = base.runtrackpos;
        runtrack = base.runtrack!;
    }
                    
    switch (runtrack[runtrackpos++])
    {
        case 0:
        {
            // 000000 *Lazybranch       addr = 29
            runtextpos = runtrack[runtrackpos++];
            goto L12;
        }
                        
        case 1:
        {
            // 000002 *Setmark
            runstackpos++;
            goto Backtrack;
        }
                        
        case 2:
        {
            // 000015 *Setloop          [-.\\w], rep = inf
            runtextpos = runtrack[runtrackpos++];
            tmp1 = runtrack[runtrackpos++]; // position
            if (tmp1 > 0)
            {
                runtrack[--runtrackpos] = tmp1 - 1;
                runtrack[--runtrackpos] = runtextpos - 1;
                runtrack[--runtrackpos] = 2;
            }
            goto L8;
        }
                        
        case 3:
        {
            // 000026 *Capturemark      index = 0
            runstack[--runstackpos] = runtrack[runtrackpos++];
            base.Uncapture();
            goto Backtrack;
        }
                        
        default:
        {
            global::System.Diagnostics.Debug.Fail($"Unexpected backtracking state {runtrack[runtrackpos - 1]}");
            break;
        }
    }
}

After

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    global::System.ReadOnlySpan<byte> byteSpan;
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Concatenate
    //{
        // Setloopatomic [+-.\\w]+
        {
            int i0 = 0;
            while ((uint)i0 < (uint)textSpan.Length && ((ch = textSpan[i0]) < 128 ? ("\0\0栀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0004\n+,-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
            {
                i0++;
            }
            if (i0 < 1)
            {
                goto NoMatch;
            }
            textSpan = textSpan.Slice(i0);
            runtextpos += i0;
        }
                        
        // UpdateBumpalong
        {
            base.runtextpos = runtextpos;
        }
                        
        // One '@'
        {
            if ((uint)textSpan.Length < 1 || textSpan[0] != '@')
            {
                goto NoMatch;
            }
        }
                        
        // Setloop [-.\\w]+
        //{
            runtextpos++;
            textSpan = textSpan.Slice(1);
            int startingRunTextPos1 = runtextpos;
            int i4 = 0;
            while ((uint)i4 < (uint)textSpan.Length && ((ch = textSpan[i4]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
            {
                i4++;
            }
            if (i4 < 1)
            {
                goto NoMatch;
            }
            textSpan = textSpan.Slice(i4);
            runtextpos += i4;
            int endingRunTextPos2 = runtextpos;
            int crawlPos3 = base.Crawlpos();
            startingRunTextPos1 += 1;
            goto EndLoop1;
                            
            Backtrack0:
            if (startingRunTextPos1 >= endingRunTextPos2)
            {
                goto NoMatch;
            }
            endingRunTextPos2 = runtext.LastIndexOf('.', endingRunTextPos2 - 1, endingRunTextPos2 - startingRunTextPos1);
            if (endingRunTextPos2 < 0)
            {
                goto NoMatch;
            }
            runtextpos = endingRunTextPos2;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                            
            EndLoop1:
        //}
                        
        // One '.'
        {
            if ((uint)textSpan.Length < 1 || textSpan[0] != '.')
            {
                goto Backtrack0;
            }
        }
                        
        // Setloopatomic [-.\\w]+
        {
            runtextpos++;
            textSpan = textSpan.Slice(1);
            int i5 = 0;
            while ((uint)i5 < (uint)textSpan.Length && ((ch = textSpan[i5]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
            {
                i5++;
            }
            if (i5 < 1)
            {
                goto Backtrack0;
            }
            textSpan = textSpan.Slice(i5);
            runtextpos += i5;
        }
                        
    //}
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}

ghost · 2021-10-14T04:09:14Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

In .NET 5, we added simpler compiled code gen for regexes that didn't entail backtracking (or that had only very constrained backtracking, such as a top-level alternation). In our corpus of ~90K regular expressions, that code generator is employed for ~40% of them. The primary purpose of adding that code generator initially was performance, as it was able to avoid lots of the expense that original code generator had, especially for simple regexes. However, with the source generator, it's much more valuable to use this code gen as the generated code is human-readable and really helps to understand how the regex is operating, is much more easily debugged, etc.

This change allows the simplified code gen to be used even if there are backtracking single-character loops in the regex, as long as those loops are in a top-level concatenation (or a simple grouping structure like a capture). This increases the percentage of expressions in our corpus that will use the simplified code gen to ~65%.

Once we have the simplified loop code gen, it's also a lot easier to add in vectorization of searching for the next location to back off to based on a literal that comes immediately after the loop (e.g. "abc.*def"). This adds support into both RegexOptions.Compiled and the source generator to use LastIndexOf in that case.

The change also entailed adding/updating a few recursive functions. The plan has been to adopt the same model as in System.Linq.Expressions, Roslyn, and elsewhere, where we fork processing to continue on a secondary thread, rather than trying to enforce some max depth or rewrite as iterative, so I've done that as part of this change as well.

As an example, the "email" benchmark from:
https://github.com/mariomka/regex-benchmark/blame/244ca6c0e4bc8dd257904c51c0b5cabba6956dd2/csharp/Benchmark.cs#L19-L20

private readonly static Regex s_email = new Regex(@"[\w\.+-]+@[\w\.-]+\.[\w\.-]+", RegexOptions.Compiled);

[Benchmark]
public int Email() => Count(s_email, s_mariomkaInput);

private static int Count(Regex r, string input)
{
    int count = 0;
    Match m = r.Match(input);
    while (m.Success)
    {
        count++;
        m = m.NextMatch();
    }
    return count;
}

Method	Toolchain	Mean	Ratio
Email	\main\corerun.exe	600.2 us	1.00
Email	\pr\corerun.exe	485.2 us	0.81

And here's the generated code for the matching logic before and after...

Before

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextbeg = base.runtextbeg;
    int runtextend = base.runtextend;
    int runtextpos = base.runtextpos;
    int[] runtrack = base.runtrack!;
    int runtrackpos = base.runtrackpos;
    int[] runstack = base.runstack!;
    int runstackpos = base.runstackpos;
    int tmp1, tmp2, ch;
                    
    // 000000 *Lazybranch       addr = 29
    L0:
    runtrack[--runtrackpos] = runtextpos;
    runtrack[--runtrackpos] = 0;
                    
    // 000002 *Setmark
    L1:
    runstack[--runstackpos] = runtextpos;
    runtrack[--runtrackpos] = 1;
                    
    // 000003  Setrep           [+-.\\w], rep = 1
    L2:
    if (runtextend - runtextpos < 1)
    {
        goto Backtrack;
    }
    for (int i = 0; i < 1; i++)
    {
        if (!((ch = runtext[runtextpos + i]) < 128 ? ("\0\0栀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0004\n+,-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            goto Backtrack;
        }
    }
    runtextpos++;
                    
    // 000006  Setloopatomic    [+-.\\w], rep = inf
    L3:
    tmp1 = runtextend - runtextpos; // length
    tmp2 = tmp1 + 1;
    while (--tmp2 > 0)
    {
        if (!((ch = runtext[runtextpos++]) < 128 ? ("\0\0栀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0004\n+,-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            runtextpos--;
            break;
        }
    }
                    
    // 000009  UpdateBumpalong
    L4:
    runtrack[^1] = runtextpos;
                    
    // 000010  One              '@'
    L5:
    if (runtextpos >= runtextend || runtext[runtextpos++] != 64)
    {
        goto Backtrack;
    }
                    
    // 000012  Setrep           [-.\\w], rep = 1
    L6:
    if (runtextend - runtextpos < 1)
    {
        goto Backtrack;
    }
    for (int i = 0; i < 1; i++)
    {
        if (!((ch = runtext[runtextpos + i]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            goto Backtrack;
        }
    }
    runtextpos++;
                    
    // 000015 *Setloop          [-.\\w], rep = inf
    L7:
    tmp1 = runtextend - runtextpos; // length
    tmp2 = tmp1 + 1;
    while (--tmp2 > 0)
    {
        if (!((ch = runtext[runtextpos++]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            runtextpos--;
            break;
        }
    }
    if (tmp2 >= tmp1)
    {
        goto L8;
    }
    runtrack[--runtrackpos] = tmp1 - tmp2 - 1;
    runtrack[--runtrackpos] = runtextpos - 1;
    runtrack[--runtrackpos] = 2;
                    
    // 000018  One              '.'
    L8:
    if (runtextpos >= runtextend || runtext[runtextpos++] != 46)
    {
        goto Backtrack;
    }
                    
    // 000020  Setrep           [-.\\w], rep = 1
    L9:
    if (runtextend - runtextpos < 1)
    {
        goto Backtrack;
    }
    for (int i = 0; i < 1; i++)
    {
        if (!((ch = runtext[runtextpos + i]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            goto Backtrack;
        }
    }
    runtextpos++;
                    
    // 000023  Setloopatomic    [-.\\w], rep = inf
    L10:
    tmp1 = runtextend - runtextpos; // length
    tmp2 = tmp1 + 1;
    while (--tmp2 > 0)
    {
        if (!((ch = runtext[runtextpos++]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
        {
            runtextpos--;
            break;
        }
    }
                    
    // 000026 *Capturemark      index = 0
    L11:
    tmp1 = runstack[runstackpos++];
    base.Capture(0, tmp1, runtextpos);
    runtrack[--runtrackpos] = tmp1;
    runtrack[--runtrackpos] = 3;
                    
    // 000029  Stop
    L12:
    base.runtextpos = runtextpos;
    return;
                    
    Backtrack:
    int limit = base.runtrackcount * 4;
    if (runstackpos < limit)
    {
        base.runstackpos = runstackpos;
        base.DoubleStack(); // might change runstackpos and runstack
        runstackpos = base.runstackpos;
        runstack = base.runstack!;
    }
    if (runtrackpos < limit)
    {
        base.runtrackpos = runtrackpos;
        base.DoubleTrack(); // might change runtrackpos and runtrack
        runtrackpos = base.runtrackpos;
        runtrack = base.runtrack!;
    }
                    
    switch (runtrack[runtrackpos++])
    {
        case 0:
        {
            // 000000 *Lazybranch       addr = 29
            runtextpos = runtrack[runtrackpos++];
            goto L12;
        }
                        
        case 1:
        {
            // 000002 *Setmark
            runstackpos++;
            goto Backtrack;
        }
                        
        case 2:
        {
            // 000015 *Setloop          [-.\\w], rep = inf
            runtextpos = runtrack[runtrackpos++];
            tmp1 = runtrack[runtrackpos++]; // position
            if (tmp1 > 0)
            {
                runtrack[--runtrackpos] = tmp1 - 1;
                runtrack[--runtrackpos] = runtextpos - 1;
                runtrack[--runtrackpos] = 2;
            }
            goto L8;
        }
                        
        case 3:
        {
            // 000026 *Capturemark      index = 0
            runstack[--runstackpos] = runtrack[runtrackpos++];
            base.Uncapture();
            goto Backtrack;
        }
                        
        default:
        {
            global::System.Diagnostics.Debug.Fail($"Unexpected backtracking state {runtrack[runtrackpos - 1]}");
            break;
        }
    }
}

After

protected override void Go()
{
    string runtext = base.runtext!;
    int runtextpos = base.runtextpos;
    int runtextend = base.runtextend;
    int originalruntextpos = runtextpos;
    global::System.ReadOnlySpan<byte> byteSpan;
    char ch;
    global::System.ReadOnlySpan<char> textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                    
    // Concatenate
    //{
        // Setloopatomic [+-.\\w]+
        {
            int i0 = 0;
            while ((uint)i0 < (uint)textSpan.Length && ((ch = textSpan[i0]) < 128 ? ("\0\0栀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0004\n+,-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
            {
                i0++;
            }
            if (i0 < 1)
            {
                goto NoMatch;
            }
            textSpan = textSpan.Slice(i0);
            runtextpos += i0;
        }
                        
        // UpdateBumpalong
        {
            base.runtextpos = runtextpos;
        }
                        
        // One '@'
        {
            if ((uint)textSpan.Length < 1 || textSpan[0] != '@')
            {
                goto NoMatch;
            }
        }
                        
        // Setloop [-.\\w]+
        //{
            runtextpos++;
            textSpan = textSpan.Slice(1);
            int startingRunTextPos1 = runtextpos;
            int i4 = 0;
            while ((uint)i4 < (uint)textSpan.Length && ((ch = textSpan[i4]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
            {
                i4++;
            }
            if (i4 < 1)
            {
                goto NoMatch;
            }
            textSpan = textSpan.Slice(i4);
            runtextpos += i4;
            int endingRunTextPos2 = runtextpos;
            int crawlPos3 = base.Crawlpos();
            startingRunTextPos1 += 1;
            goto EndLoop1;
                            
            Backtrack0:
            if (startingRunTextPos1 >= endingRunTextPos2)
            {
                goto NoMatch;
            }
            endingRunTextPos2 = runtext.LastIndexOf('.', endingRunTextPos2 - 1, endingRunTextPos2 - startingRunTextPos1);
            if (endingRunTextPos2 < 0)
            {
                goto NoMatch;
            }
            runtextpos = endingRunTextPos2;
            textSpan = global::System.MemoryExtensions.AsSpan(runtext, runtextpos, runtextend - runtextpos);
                            
            EndLoop1:
        //}
                        
        // One '.'
        {
            if ((uint)textSpan.Length < 1 || textSpan[0] != '.')
            {
                goto Backtrack0;
            }
        }
                        
        // Setloopatomic [-.\\w]+
        {
            runtextpos++;
            textSpan = textSpan.Slice(1);
            int i5 = 0;
            while ((uint)i5 < (uint)textSpan.Length && ((ch = textSpan[i5]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : CharInClass((char)ch, "\0\u0002\n-/\0\u0002\u0004\u0005\u0003\u0001\u0006\t\u0013\0")))
            {
                i5++;
            }
            if (i5 < 1)
            {
                goto Backtrack0;
            }
            textSpan = textSpan.Slice(i5);
            runtextpos += i5;
        }
                        
    //}
                    
    // Match
    base.runtextpos = runtextpos;
    base.Capture(0, originalruntextpos, runtextpos);
    return;
                    
    // No match
    NoMatch:
    return;
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	7.0.0

danmoseley · 2021-10-14T04:24:43Z

Just curious, what does the generated code look like for this example? could you share?

stephentoub · 2021-10-14T04:31:04Z

Just curious, what does the generated code look like for this example? could you share?

It's already in the PR description. Expand the before/ after nodes.

danmoseley · 2021-10-14T04:41:51Z

It's surprisingly readable!

stephentoub · 2021-10-14T12:36:14Z

It's surprisingly readable!

😯 Oh ye of little faith 😄

jeffhandley · 2021-10-15T03:57:43Z

😯 Oh ye of little faith 😄

There's a joke here somewhere about the readability of goto statements (that is still respectful), but I don't know what it is. 😼

stephentoub · 2021-10-17T03:46:57Z

@BrzVlad, any idea why the Libraries Test Run release mono_interpreter Linux x64 Debug failed leg is failing here? It looks like it's getting a seg fault.

BrzVlad · 2021-10-17T09:54:32Z

@stephentoub Should get fixed by #60514

stephentoub · 2021-10-17T11:04:45Z

Thanks

src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs

eerhardt

Just a few questions - mostly for my learning.

src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs

...gularExpressions/src/System/Text/RegularExpressions/Symbolic/RegexNodeToSymbolicConverter.cs

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

stephentoub · 2021-10-21T19:47:02Z

@safern, some drawing tests have repeatedly failed on this PR. It's not clear to me how my changes here could have broken this, but it's failed multiple times.

    System.Drawing.Tests.PenTests.Ctor_Brush_Width<SolidBrush>(brush: SolidBrush { Color = Color [Red] }, width: 0, expectedPenType: SolidColor) [FAIL]
      Assert.Equal() Failure
      Expected: 0
      Actual:   1
      Stack Trace:
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(1384,0): at System.Drawing.Tests.PenTests.VerifyPen[T](Pen pen, PenType expectedPenType, Single expectedWidth)
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(62,0): at System.Drawing.Tests.PenTests.Ctor_Brush_Width[T](T brush, Single width, PenType expectedPenType)
    System.Drawing.Tests.PenTests.Ctor_Brush_Width<SolidBrush>(brush: SolidBrush { Color = Color [Red] }, width: -1, expectedPenType: SolidColor) [FAIL]
      Assert.Equal() Failure
      Expected: -1
      Actual:   1
      Stack Trace:
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(1384,0): at System.Drawing.Tests.PenTests.VerifyPen[T](Pen pen, PenType expectedPenType, Single expectedWidth)
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(62,0): at System.Drawing.Tests.PenTests.Ctor_Brush_Width[T](T brush, Single width, PenType expectedPenType)
    System.Drawing.Tests.PenTests.Ctor_Brush_Width<SolidBrush>(brush: SolidBrush { Color = Color [Red] }, width: -�, expectedPenType: SolidColor) [FAIL]
      Assert.Equal() Failure
      Expected: -�
      Actual:   1
      Stack Trace:
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(1384,0): at System.Drawing.Tests.PenTests.VerifyPen[T](Pen pen, PenType expectedPenType, Single expectedWidth)
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(62,0): at System.Drawing.Tests.PenTests.Ctor_Brush_Width[T](T brush, Single width, PenType expectedPenType)
    System.Drawing.Tests.PenTests.Ctor_Color_Width(width: -1) [FAIL]
      Assert.Equal() Failure
      Expected: -1
      Actual:   1
      Stack Trace:
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(1384,0): at System.Drawing.Tests.PenTests.VerifyPen[T](Pen pen, PenType expectedPenType, Single expectedWidth)
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(116,0): at System.Drawing.Tests.PenTests.Ctor_Color_Width(Single width)
    System.Drawing.Tests.PenTests.Ctor_Color_Width(width: 0) [FAIL]
      Assert.Equal() Failure
      Expected: 0
      Actual:   1
      Stack Trace:
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(1384,0): at System.Drawing.Tests.PenTests.VerifyPen[T](Pen pen, PenType expectedPenType, Single expectedWidth)
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(116,0): at System.Drawing.Tests.PenTests.Ctor_Color_Width(Single width)
    System.Drawing.Tests.PenTests.Ctor_Color_Width(width: -�) [FAIL]
      Assert.Equal() Failure
      Expected: -�
      Actual:   1
      Stack Trace:
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(1384,0): at System.Drawing.Tests.PenTests.VerifyPen[T](Pen pen, PenType expectedPenType, Single expectedWidth)
        /_/src/libraries/System.Drawing.Common/tests/PenTests.cs(116,0): at System.Drawing.Tests.PenTests.Ctor_Color_Width(Single width)

Is there any known issue here? I don't see any open issues for it.

stephentoub · 2021-10-21T19:52:02Z

Ah, I see #60731 was literally just created.

safern · 2021-10-21T20:06:26Z

Hmm interesting, I'll dig into what cause them to start failing and disable them. Thanks for the ping.

…ified code gen In .NET 5, we added simpler compiled code gen for regexes that didn't entail backtracking (or that had only very constrained backtracking, such as a top-level alternation). In our corpus of ~90K regular expressions, that code generator is employed for ~40% of them. The primary purpose of adding that code generator initially was performance, as it was able to avoid lots of the expense that original code generator had, especially for simple regexes. However, with the source generator, it's much more valuable to use this code gen as the generated code is human-readable and really helps to understand how the regex is operating, is much more easily debugged, etc. This change allows the simplified code gen to be used even if there are backtracking single-character loops in the regex, as long as those loops are in a top-level concatenation (or a simple grouping structure like a capture). This increases the percentage of expressions in our corpus that will use the simplified code gen to ~65%. Once we have the simplified loop code gen, it's also a lot easier to add in vectorization of searching for the next location to back off to based on a literal that comes immediately after the loop (e.g. "abc.*def"). This adds support into both RegexOptions.Compiled and the source generator to use LastIndexOf in that case. The change also entailed adding/updating a few recursive functions. The plan has been to adopt the same model as in System.Linq.Expressions, Roslyn, and elsewhere, where we fork processing to continue on a secondary thread, rather than trying to enforce some max depth or rewrite as iterative, so I've done that as part of this change as well.

kunalspathak · 2021-10-26T15:39:44Z

Linux/x64 improvement - dotnet/perf-autofiling-issues#1975

stephentoub · 2021-10-26T15:41:45Z

Linux/x64 improvement

Excellent.

stephentoub added the area-System.Text.RegularExpressions label Oct 14, 2021

stephentoub added this to the 7.0.0 milestone Oct 14, 2021

stephentoub force-pushed the simplebacktrackingloops branch from 253d091 to 14eed58 Compare October 14, 2021 17:27

This was referenced Oct 14, 2021

System.IO.Tests.File_ReadWriteAllBytes.ReadAllBytes_NonSeekableFileStream_InWindows failed #60427

Open

system.io.tests.file_readwriteallbytes.readallbytes_nonseekablefilestream_inwindows #60444

Closed

stephentoub requested a review from eerhardt October 18, 2021 13:56

am11 reviewed Oct 19, 2021

View reviewed changes

src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs Outdated Show resolved Hide resolved

stephentoub force-pushed the simplebacktrackingloops branch 3 times, most recently from 85a8031 to 7dfe534 Compare October 19, 2021 14:30

runfoapp bot mentioned this pull request Oct 19, 2021

1ES Pools often run out of space #60610

Closed

eerhardt approved these changes Oct 20, 2021

View reviewed changes

stephentoub force-pushed the simplebacktrackingloops branch 2 times, most recently from dd72b32 to 124ddcf Compare October 21, 2021 18:05

stephentoub added 3 commits October 21, 2021 17:06

Address PR feedback

3d4ac7d

Clean up partial classes in SourceGenRegexAsync test helper

e8bb072

stephentoub force-pushed the simplebacktrackingloops branch from 124ddcf to e8bb072 Compare October 21, 2021 21:14

stephentoub merged commit 8c8157f into dotnet:main Oct 22, 2021

stephentoub deleted the simplebacktrackingloops branch October 22, 2021 04:34

runfoapp bot mentioned this pull request Oct 22, 2021

system.drawing.tests.pentests.ctor_color_width failing in CI #60731

Open

stephentoub mentioned this pull request Nov 16, 2021

Add single char lazy loop support to simplified Regex code gen #61698

Merged

ghost locked as resolved and limited conversation to collaborators Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add limited support for backtracking Regex single char loops to simplified code gen #60385

Add limited support for backtracking Regex single char loops to simplified code gen #60385

stephentoub commented Oct 14, 2021

ghost commented Oct 14, 2021

danmoseley commented Oct 14, 2021

stephentoub commented Oct 14, 2021

danmoseley commented Oct 14, 2021

stephentoub commented Oct 14, 2021

jeffhandley commented Oct 15, 2021

stephentoub commented Oct 17, 2021

BrzVlad commented Oct 17, 2021

stephentoub commented Oct 17, 2021

eerhardt left a comment

stephentoub commented Oct 21, 2021 •

edited

Loading

stephentoub commented Oct 21, 2021

safern commented Oct 21, 2021 •

edited

Loading

kunalspathak commented Oct 26, 2021

stephentoub commented Oct 26, 2021

Add limited support for backtracking Regex single char loops to simplified code gen #60385

Add limited support for backtracking Regex single char loops to simplified code gen #60385

Conversation

stephentoub commented Oct 14, 2021

ghost commented Oct 14, 2021

danmoseley commented Oct 14, 2021

stephentoub commented Oct 14, 2021

danmoseley commented Oct 14, 2021

stephentoub commented Oct 14, 2021

jeffhandley commented Oct 15, 2021

stephentoub commented Oct 17, 2021

BrzVlad commented Oct 17, 2021

stephentoub commented Oct 17, 2021

eerhardt left a comment

Choose a reason for hiding this comment

stephentoub commented Oct 21, 2021 • edited Loading

stephentoub commented Oct 21, 2021

safern commented Oct 21, 2021 • edited Loading

kunalspathak commented Oct 26, 2021

stephentoub commented Oct 26, 2021

stephentoub commented Oct 21, 2021 •

edited

Loading

safern commented Oct 21, 2021 •

edited

Loading