-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Fix NonBacktracking regex empty capture groups with newline at end of input #120622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
The issue was in DefaultInputReader.GetPositionId which was applying special handling for '\n' at the end of input unconditionally. This special handling is only needed for the \Z anchor. Added a check for _containsEndZAnchor to only apply the special handling when needed. Also added regression tests to verify the fix. Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a regression in .NET 9 where NonBacktracking regex capture groups were incorrectly returned as empty strings when the input ended with a newline character, even though the overall match succeeded.
- Modified the condition in
DefaultInputReader.GetPositionId
to only apply special\n
handling when the pattern actually contains the\Z
anchor - Added regression test cases to verify the fix works correctly
- Ensures existing
\Z
anchor behavior remains unchanged
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs |
Fixed the condition to only apply special newline handling when \Z anchor is present |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Groups.Tests.cs |
Added regression test cases for the fixed capture group behavior |
cc: @ieviev |
i found another inconsistency:
this might also be related to the handling of \Z |
@copilot, can you investigate that as well? |
I've already investigated and fixed the issue described in #120202 (empty capture groups with NonBacktracking when input ends with '\n'). The fix conditions the special '\n' handling on Could you clarify what specific aspect you'd like me to investigate further? |
The issue was that FindSubcaptures was using DefaultInputReader which applies special \n handling for \Z anchor. This interfered with capture tracking for patterns with beginning anchors but without \Z. The fix is to make FindSubcaptures always use NoZAnchorOptimizedInputReader, which doesn't apply the special \n handling. This allows: - Match finding to use the special \n handling for \Z anchor (correct) - Capture tracking to use the actual character without special handling (correct) Added regression tests for all anchor combinations. Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
I've investigated and fixed the inconsistency. The root cause was that The fix (commit 4fe286e) makes All anchor combinations now work correctly:
Added regression tests for all these patterns. |
@ieviev, does that sound right? |
@copilot |
if |
@copilot, they wrote:
|
Based on feedback from @ieviev, the special \n handling should only apply for \Z anchor (uppercase), not for \z or other anchors. This ensures: - Patterns with \Z use special handling (for lookahead semantics) - Patterns with \z, ^, or no end anchor use normal handling (correct captures) Added test case for (A)(\s)\z without beginning anchor as requested. All 29,315 tests pass. Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Added test case for
All 29,315 tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
DefaultInputReader.GetPositionId
andFindSubcaptures
method_containsEndZAnchor
for both match finding and capture trackingSummary
Fixed a regression in NonBacktracking regex where capture groups would be empty when the input ends with a newline character.
Root Cause: The
DefaultInputReader.GetPositionId
method was unconditionally applying special handling for '\n' at the end of input (returning_minterms.Length
instead of the normal minterm ID). This special handling is specifically designed for the\Z
anchor to support lookahead semantics, but was being applied to ALL anchored patterns, causing incorrect derivative transitions and empty capture groups.Solution:
DefaultInputReader.GetPositionId
: condition the special '\n' handling on_containsEndZAnchor
(only for\Z
anchor)FindSubcaptures
: useDefaultInputReader
when_containsEndZAnchor
is true (needs special handling), otherwise useNoZAnchorOptimizedInputReader
(normal handling)This ensures:
\Z
or$
(non-multiline) use special '\n' handling for correct lookahead semantics\z
,^
, or no end anchor use normal '\n' handling for correct capture trackingTesting:
^(A)(\s)
,^(A)(\s)\z
,^(A)(\s)$
,^(A)(\s)\Z
, and(A)(\s)\z
\Z
anchor behavior remains correctOriginal prompt
Fixes #120202
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.