You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix NonBacktracking regex empty capture groups with newline at end of input (#120622)
- [x] Investigate and understand the issue with empty capture groups
when NonBacktracking is used with newline at end
- [x] Identify root cause in `DefaultInputReader.GetPositionId` and
`FindSubcaptures` method
- [x] Create focused test cases that reproduce the issue for all anchor
combinations
- [x] Fix the bug by conditioning special \n handling on
`_containsEndZAnchor` for both match finding and capture tracking
- [x] Build and test the changes
- [x] Run full test suite to verify no regressions (29,315 functional
tests + 1,005 unit tests passed)
- [x] Address follow-up concerns about \z anchor behavior
## Summary
Fixed a regression in NonBacktracking regex where capture groups would
be empty when the input ends with a newline character.
**Root Cause**: The `DefaultInputReader.GetPositionId` method was
unconditionally applying special handling for '\n' at the end of input
(returning `_minterms.Length` instead of the normal minterm ID). This
special handling is specifically designed for the `\Z` anchor to support
lookahead semantics, but was being applied to ALL anchored patterns,
causing incorrect derivative transitions and empty capture groups.
**Solution**:
1. In `DefaultInputReader.GetPositionId`: condition the special '\n'
handling on `_containsEndZAnchor` (only for `\Z` anchor)
2. In `FindSubcaptures`: use `DefaultInputReader` when
`_containsEndZAnchor` is true (needs special handling), otherwise use
`NoZAnchorOptimizedInputReader` (normal handling)
This ensures:
- Patterns with `\Z` or `$` (non-multiline) use special '\n' handling
for correct lookahead semantics
- Patterns with `\z`, `^`, or no end anchor use normal '\n' handling for
correct capture tracking
**Testing**:
- Added regression test cases for all anchor combinations: `^(A)(\s)`,
`^(A)(\s)\z`, `^(A)(\s)$`, `^(A)(\s)\Z`, and `(A)(\s)\z`
- All 29,315 functional tests pass
- All 1,005 unit tests pass
- Verified existing `\Z` anchor behavior remains correct
<!-- START COPILOT CODING AGENT SUFFIX -->
<details>
<summary>Original prompt</summary>
>
> ----
>
> *This section details on the original issue you should resolve*
>
> <issue_title>Newline causes empty regex capture groups when
NonBacktracking in dotnet 9 and above</issue_title>
> <issue_description>### Description
>
> Found while upgrading 8 -> 10, but appears in 9 as well. I'm fairly
sure I can't find an explanation of it in the documentation.
>
> Both Nonbacktracking and having the final character be a \n rather
than any other kind of whitespace appear to be critical to triggering
the bug. Nonbacktracking is in there because this case was cut down from
a much larger regular expression while trying to diagnose this. It is
important to the larger expression for performance.
>
> ### Reproduction Steps
>
> `using System.Text.RegularExpressions;
>
> string line = "A\n";
> string expression = "^(A)(\\s)";
> Match match = Regex.Match(line, expression, RegexOptions.IgnoreCase |
RegexOptions.NonBacktracking);
> if (!match.Success)
> {
> System.Environment.Exit(1);
> }
>
> Console.WriteLine($"reg: '{match.Groups[1]}' '{match.Groups[2]}'");`
>
> ### Expected behavior
>
> `reg: 'A' '
> '`
>
> The two characters of input are each matched and end up captured in a
group.
>
> ### Actual behavior
>
> 'reg: '' ''
>
> Both capture groups end up empty, even though the regex has matched.
>
> ### Regression?
>
> Works as expected in dotnet 8.
>
> ### Known Workarounds
>
> Applying Trim() to the string rather than relying on regex matching of
the start and end, or applying RegexOptions.Multiline
>
> ### Configuration
>
> Dotnet 10.0.0-rc.1.25451.107 on Windows x64 in VS Insiders.
>
> As above - this code has been in place for many years and works on
dotnet 8 and many previous versions.
>
> ### Other information
>
> _No response_</issue_description>
>
> <agent_instructions>This will need investigation to determine the root
cause. I suspect it was introduced in
https://github.com/dotnet/runtime/pull/102655</agent_instructions>
>
> ## Comments on the Issue (you are @copilot in this section)
>
> <comments>
> </comments>
>
</details>
Fixes#120202
<!-- START COPILOT CODING AGENT TIPS -->
---
✨ Let Copilot coding agent [set things up for
you](https://github.com/dotnet/runtime/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copy file name to clipboardExpand all lines: src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
+5-3Lines changed: 5 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -474,7 +474,9 @@ public SymbolicMatch FindMatch(RegexRunnerMode mode, ReadOnlySpan<char> input, i
474
474
}
475
475
else
476
476
{
477
-
RegistersendRegisters=_containsAnyAnchor?
477
+
// Use DefaultInputReader when pattern contains \Z anchor (needs special \n handling),
478
+
// otherwise use NoZAnchorOptimizedInputReader for correct capture tracking
0 commit comments