-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Update Regex named blocks from Unicode 4.0 to Unicode 16.0 #120623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
Added 52 new Unicode blocks and updated existing ones to match Unicode 16.0 specification. Total count increased from 108 to 160 named blocks. Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
- Add IsGreek as alias for IsGreekandCoptic for backward compatibility - Add comprehensive tests for 52 new Unicode blocks - Remove tests for deprecated surrogate and private use blocks - All 29,287 tests now passing Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Document how to use the tool for future Unicode updates Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the Unicode named blocks in System.Text.RegularExpressions
from Unicode 4.0 to Unicode 16.0, adding 52 new Unicode blocks while maintaining backward compatibility. The update enables regex patterns to match characters from modern scripts and symbol sets that were not previously supported.
Key changes include:
- Added 52 new Unicode blocks covering various scripts like Arabic Extended, Balinese, Cherokee Supplement, and many others
- Created a new tool
GenRegexNamedBlocks
to automate future Unicode updates - Updated test coverage to include all new blocks
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
src/libraries/System.Text.RegularExpressions/tools/GenRegexNamedBlocks/README.md |
Documentation for the new Unicode block generation tool |
src/libraries/System.Text.RegularExpressions/tools/GenRegexNamedBlocks/Program.cs |
Tool implementation for parsing Unicode Blocks.txt and generating regex named block entries |
src/libraries/System.Text.RegularExpressions/tools/GenRegexNamedBlocks/GenRegexNamedBlocks.csproj |
Project file for the Unicode block generation tool |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexCharacterSetTests.cs |
Added test coverage for all 52 new Unicode blocks and removed tests for deprecated surrogate blocks |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs |
Updated named blocks array with 52 new Unicode blocks, removed deprecated surrogate and private use blocks |
{ | ||
// We only care about lines of the form "XXXX..XXXX; Block name" | ||
var match = inputLineRegex.Match(inputLine); | ||
if (match == null || !match.Success) |
Copilot
AI
Oct 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null check for match
is unnecessary and incorrect. The Match
method never returns null; it returns a Match
object with Success = false
when no match is found. The condition should be if (!match.Success)
.
if (match == null || !match.Success) | |
if (!match.Success) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
// Sort by start code for consistent output | ||
entries.Sort((a, b) => string.Compare(a.name, b.name, StringComparison.Ordinal)); |
Copilot
AI
Oct 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment indicates sorting by start code, but the code is actually sorting by name. Either update the comment to 'Sort by name for consistent output' or change the sort logic to sort by start code.
entries.Sort((a, b) => string.Compare(a.name, b.name, StringComparison.Ordinal)); | |
entries.Sort((a, b) => string.Compare(a.startCode, b.startCode, StringComparison.Ordinal)); |
Copilot uses AI. Check for mistakes.
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Show resolved
Hide resolved
Add back 6 blocks that were removed to avoid breaking changes: - IsCombiningMarksforSymbols (alias for IsCombiningDiacriticalMarksforSymbols) - IsHighPrivateUseSurrogates - IsHighSurrogates - IsLowSurrogates - IsPrivateUse (alias for IsPrivateUseArea) - IsPrivateUseArea Total blocks now: 167 (108 original + 59 new) Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexCharacterSetTests.cs
Show resolved
Hide resolved
Add back test coverage for the 6 restored blocks: - IsHighSurrogates - IsHighPrivateUseSurrogates - IsLowSurrogates - IsPrivateUse Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
- Move IsHebrew to correct alphabetical position (before IsHighPrivateUseSurrogates) - Restore IsSpecials to original value "\uFFF0" (was incorrectly changed to "\uFFF0\u10000") Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
@copilot, can you update this for Unicode 17.0? |
Summary
Successfully updated Regex named blocks from Unicode 4.0 to Unicode 16.0:
The changes enable .NET developers to use regular expressions with up-to-date Unicode block names while maintaining complete backward compatibility.
Original prompt
Fixes #118062
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.