Skip to content

Commit aa9c20b

Browse files
Copilotstephentoub
andauthored
Fix NonBacktracking regex empty capture groups with newline at end of input (#120622)
- [x] Investigate and understand the issue with empty capture groups when NonBacktracking is used with newline at end - [x] Identify root cause in `DefaultInputReader.GetPositionId` and `FindSubcaptures` method - [x] Create focused test cases that reproduce the issue for all anchor combinations - [x] Fix the bug by conditioning special \n handling on `_containsEndZAnchor` for both match finding and capture tracking - [x] Build and test the changes - [x] Run full test suite to verify no regressions (29,315 functional tests + 1,005 unit tests passed) - [x] Address follow-up concerns about \z anchor behavior ## Summary Fixed a regression in NonBacktracking regex where capture groups would be empty when the input ends with a newline character. **Root Cause**: The `DefaultInputReader.GetPositionId` method was unconditionally applying special handling for '\n' at the end of input (returning `_minterms.Length` instead of the normal minterm ID). This special handling is specifically designed for the `\Z` anchor to support lookahead semantics, but was being applied to ALL anchored patterns, causing incorrect derivative transitions and empty capture groups. **Solution**: 1. In `DefaultInputReader.GetPositionId`: condition the special '\n' handling on `_containsEndZAnchor` (only for `\Z` anchor) 2. In `FindSubcaptures`: use `DefaultInputReader` when `_containsEndZAnchor` is true (needs special handling), otherwise use `NoZAnchorOptimizedInputReader` (normal handling) This ensures: - Patterns with `\Z` or `$` (non-multiline) use special '\n' handling for correct lookahead semantics - Patterns with `\z`, `^`, or no end anchor use normal '\n' handling for correct capture tracking **Testing**: - Added regression test cases for all anchor combinations: `^(A)(\s)`, `^(A)(\s)\z`, `^(A)(\s)$`, `^(A)(\s)\Z`, and `(A)(\s)\z` - All 29,315 functional tests pass - All 1,005 unit tests pass - Verified existing `\Z` anchor behavior remains correct <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>Newline causes empty regex capture groups when NonBacktracking in dotnet 9 and above</issue_title> > <issue_description>### Description > > Found while upgrading 8 -> 10, but appears in 9 as well. I'm fairly sure I can't find an explanation of it in the documentation. > > Both Nonbacktracking and having the final character be a \n rather than any other kind of whitespace appear to be critical to triggering the bug. Nonbacktracking is in there because this case was cut down from a much larger regular expression while trying to diagnose this. It is important to the larger expression for performance. > > ### Reproduction Steps > > `using System.Text.RegularExpressions; > > string line = "A\n"; > string expression = "^(A)(\\s)"; > Match match = Regex.Match(line, expression, RegexOptions.IgnoreCase | RegexOptions.NonBacktracking); > if (!match.Success) > { > System.Environment.Exit(1); > } > > Console.WriteLine($"reg: '{match.Groups[1]}' '{match.Groups[2]}'");` > > ### Expected behavior > > `reg: 'A' ' > '` > > The two characters of input are each matched and end up captured in a group. > > ### Actual behavior > > 'reg: '' '' > > Both capture groups end up empty, even though the regex has matched. > > ### Regression? > > Works as expected in dotnet 8. > > ### Known Workarounds > > Applying Trim() to the string rather than relying on regex matching of the start and end, or applying RegexOptions.Multiline > > ### Configuration > > Dotnet 10.0.0-rc.1.25451.107 on Windows x64 in VS Insiders. > > As above - this code has been in place for many years and works on dotnet 8 and many previous versions. > > ### Other information > > _No response_</issue_description> > > <agent_instructions>This will need investigation to determine the root cause. I suspect it was introduced in https://github.com/dotnet/runtime/pull/102655</agent_instructions> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> Fixes #120202 <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/dotnet/runtime/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
1 parent b9969dd commit aa9c20b

File tree

2 files changed

+16
-3
lines changed

2 files changed

+16
-3
lines changed

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -474,7 +474,9 @@ public SymbolicMatch FindMatch(RegexRunnerMode mode, ReadOnlySpan<char> input, i
474474
}
475475
else
476476
{
477-
Registers endRegisters = _containsAnyAnchor ?
477+
// Use DefaultInputReader when pattern contains \Z anchor (needs special \n handling),
478+
// otherwise use NoZAnchorOptimizedInputReader for correct capture tracking
479+
Registers endRegisters = _containsEndZAnchor ?
478480
FindSubcaptures<DefaultInputReader>(input, matchStart, matchEnd, perThreadData) :
479481
FindSubcaptures<NoZAnchorOptimizedInputReader>(input, matchStart, matchEnd, perThreadData);
480482
return new SymbolicMatch(matchStart, matchEnd - matchStart, endRegisters.CaptureStarts, endRegisters.CaptureEnds);
@@ -1481,9 +1483,9 @@ public static int GetPositionId(SymbolicRegexMatcher<TSet> matcher, ReadOnlySpan
14811483
{
14821484
if ((uint)pos < (uint)input.Length)
14831485
{
1484-
// Find the minterm, handling the special case for the last \n for states that start with a relevant anchor
1486+
// Find the minterm, handling the special case for the last \n for \Z anchor
14851487
int c = input[pos];
1486-
return c == '\n' && pos == input.Length - 1 ?
1488+
return c == '\n' && pos == input.Length - 1 && matcher._containsEndZAnchor ?
14871489
matcher._minterms.Length : // mintermId = minterms.Length represents an \n at the very end of input
14881490
matcher._mintermClassifier.GetMintermID(c);
14891491
}

src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Groups.Tests.cs

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,17 @@ public static IEnumerable<object[]> Groups_MemberData()
461461
yield return (enUS, @"^(cat)\s+(dog)", "cat \n\n\n dog", RegexOptions.None, new string[] { "cat \n\n\n dog", "cat", "dog" });
462462
yield return (enUS, @"^(cat)\s+(dog)", "cat \n\n\n dog", RegexOptions.Multiline, new string[] { "cat \n\n\n dog", "cat", "dog" });
463463
yield return (enUS, @"(mouse)\s\n^(cat)\s+(dog)", "mouse\n\ncat \n\n\n dog", RegexOptions.Multiline, new string[] { "mouse\n\ncat \n\n\n dog", "mouse", "cat", "dog" });
464+
465+
// Regression test for https://github.com/dotnet/runtime/issues/120202
466+
// Capture groups with beginning anchor and newline at end should work correctly with different end anchors
467+
yield return (enUS, @"^(A)(\s)", "A\n", RegexOptions.None, new string[] { "A\n", "A", "\n" });
468+
yield return (enUS, @"^(A)(\s)\z", "A\n", RegexOptions.None, new string[] { "A\n", "A", "\n" });
469+
yield return (enUS, @"^(A)(\s)$", "A\n", RegexOptions.None, new string[] { "A\n", "A", "\n" });
470+
yield return (enUS, @"^(A)(\s)\Z", "A\n", RegexOptions.None, new string[] { "A\n", "A", "\n" });
471+
yield return (enUS, @"(A)(\s)\z", "A\n", RegexOptions.None, new string[] { "A\n", "A", "\n" }); // without beginning anchor
472+
yield return (enUS, @"^(A)(\s)", "A\n", RegexOptions.Multiline, new string[] { "A\n", "A", "\n" });
473+
yield return (enUS, @"^(A)(\s)", "A ", RegexOptions.None, new string[] { "A ", "A", " " });
474+
464475
if (!RegexHelpers.IsNonBacktracking(engine)) // ECMAScript not supported
465476
{
466477
yield return (enUS, @"^cat\s+dog", "cat \n\n\n dog", RegexOptions.ECMAScript, new string[] { "cat \n\n\n dog" });

0 commit comments

Comments
 (0)