Use bit mask in Regex's IsWordChar #84003

stephentoub · 2023-03-28T00:46:56Z

Improves throughput of IsWordChar on non-ASCII chars by ~15%.

Roslyn has also fixed the codegen for is/is not patterns, so the switch expressions used here instead have been switched back.

ghost · 2023-03-28T00:47:11Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Improves throughput of IsWordChar on non-ASCII chars by ~15%.

Roslyn has also fixed the codegen for is/is not patterns, so the switch expressions used here instead have been switched back.

Author:	stephentoub
Assignees:	stephentoub
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

IDisposable

There are three blocks of identical code (not counting the source emitting version nor the test) for the WordCategories constant. Could we pull this out somewhere common so we don't eventually get them out of sync?

stephentoub · 2023-03-28T10:22:27Z

Could we pull this out somewhere common so we don't eventually get them out of sync?

There are three copies of the const. We could remove one of the three by combining two of them into a field const, but the one in the source generator needs to remain separate. The categories that make up \w are also never going to change.

IDisposable · 2023-03-28T18:45:31Z

Working on the coalescing of the constant...

Also I noticed a what looks like a typo

 int chDiv8 = ch >> 3;
 return (uint)chDiv8 < (uint)ascii.Length ?
     (ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
     ((WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0 ||
      (ch == ZeroWidthJoiner | ch == ZeroWidthNonJoiner));

That last line shoudn't the | ch == actually be || ch == so we're doing a logical OR not a bitwise OR?

IDisposable · 2023-03-28T19:08:15Z

@stephentoub is there a performance history that explains the duplicate code in IsWordChar(char ch) and IsBoundaryWordChar(char ch)? Specifically, is there any reason we should not just make IsBoundaryWordChar call into IsWordChar like

  public static bool IsBoundaryWordChar(char ch)
  {
      // According to UTS#18 Unicode Regular Expressions (http://www.unicode.org/reports/tr18/)
      // RL 1.4 Simple Word Boundaries  The class of <word_character> includes all Alphabetic
      // values from the Unicode character database, from UnicodeData.txt [UData], plus the U+200C
      // ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER.
      const char ZeroWidthNonJoiner = '\u200C', ZeroWidthJoiner = '\u200D';
      return IsWordChar(ch) || ch == ZeroWidthJoiner || ch == ZeroWidthNonJoiner;
  }

For branch prediction and common data, I would expect the ordering shown above to be more valuable than doing the cheaper, but less likely character equality checks first...

stephentoub · 2023-03-28T19:10:22Z

Specifically, is there any reason we should not just make IsBoundaryWordChar call into IsWordChar like

Because then you'll be doing the extra boundary checks even in the ASCII case where they're unnecessary.

stephentoub · 2023-03-28T19:11:31Z

That last line shoudn't the | ch == actually be || ch == so we're doing a logical OR not a bitwise OR?

That is on purpose based on perf testing and the elimination of the branch on average being better.

IDisposable · 2023-03-28T19:14:08Z

Glad I asked! I've got the PR ready then... one other question. Would it be ugly to expose this constant in some way so the Emitter generated code could just reference it? Is there precedent for that sort of construct?

stephentoub · 2023-03-28T19:19:37Z

Would it be ugly to expose this constant in some way so the Emitter generated code could just reference it?

Yes. We want the emitted code to be as readable as possible, which means it's better for it to read:

           const int WordCategories =
               1 << (int)UnicodeCategory.UppercaseLetter |
               1 << (int)UnicodeCategory.LowercaseLetter |
               1 << (int)UnicodeCategory.TitlecaseLetter |
               1 << (int)UnicodeCategory.ModifierLetter |
               1 << (int)UnicodeCategory.OtherLetter |
               1 << (int)UnicodeCategory.NonSpacingMark |
               1 << (int)UnicodeCategory.DecimalDigitNumber |
               1 << (int)UnicodeCategory.ConnectorPunctuation;

than to read:

           const int WordCategories = 0x4013F;

EgorBo · 2023-03-28T19:41:16Z

than to read:

Reminds me this

runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs

Lines 90 to 92 in c3983a2

    
           internal static bool IsPrimitiveType(this CorElementType et) 
        
               // COR_ELEMENT_TYPE_I1,I2,I4,I8,U1,U2,U4,U8,R4,R8,I,U,CHAR,BOOLEAN 
        
               => ((1 << (int)et) & 0b_0011_0000_0000_0011_1111_1111_1100) != 0;

IDisposable · 2023-03-28T20:06:48Z

PR submitted, I did make a change in Emitter, but ONLY to change the name to WordCategoriesMask to better match the library code.

My follow-up question for the #84003 (comment) is more along the lines of should we expose a public constant from RegexCharClass that is referenced in the Emitter's output code such that the emitted code looks like

    "    return (uint)chDiv8 < (uint)ascii.Length ?",
    "        (ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :",
    "        (RegexCharClass.WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;",
    "}",
});

stephentoub · 2023-03-28T20:10:17Z

My follow-up question for the #84003 (comment) is more along the lines of should we expose a public constant from RegexCharClass that is referenced in the Emitter's output code such that the emitted code looks like

No, I don't believe there's enough value in that (and RegexCharClass is internal). This is how we currently do the check; that could change in the future, and we'd be exposing public API for what's an implementation detail.

ghost assigned stephentoub Mar 28, 2023

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Mar 28, 2023

MihaZupan reviewed Mar 28, 2023

View reviewed changes

...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Outdated Show resolved Hide resolved

Use bit mask in Regex's IsWordChar

65bb079

stephentoub changed the title ~~Use Vector128 in Regex's IsWordChar~~ Use bit mask in Regex's IsWordChar Mar 28, 2023

stephentoub force-pushed the iswordchar branch from 78a7499 to 65bb079 Compare March 28, 2023 01:50

MihaZupan approved these changes Mar 28, 2023

View reviewed changes

IDisposable reviewed Mar 28, 2023

View reviewed changes

This was referenced Mar 28, 2023

WasmTestOnBrowser-System.* test failures in CI #83655

Closed

CI failure STRICT_JS doesn't work with MODULARIZE or EXPORT_ES6 #83986

Closed

stephentoub merged commit ed921ee into dotnet:main Mar 28, 2023

stephentoub deleted the iswordchar branch March 28, 2023 10:22

IDisposable mentioned this pull request Mar 28, 2023

Coalesce identical constant WordCategories #84047

Merged

kunalspathak mentioned this pull request Mar 30, 2023

[Perf] Windows/arm64: 3 Improvements on 3/28/2023 10:24:48 AM dotnet/perf-autofiling-issues#14628

Closed

AndyAyersMS mentioned this pull request Mar 30, 2023

[Perf] Linux/arm64: 4 Regressions on 3/28/2023 2:42:21 AM dotnet/perf-autofiling-issues#14633

Closed

kunalspathak mentioned this pull request Mar 30, 2023

[Perf] Windows/arm64: 6 Improvements on 3/28/2023 10:24:48 AM dotnet/perf-autofiling-issues#14621

Closed

ghost locked as resolved and limited conversation to collaborators Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bit mask in Regex's IsWordChar #84003

Use bit mask in Regex's IsWordChar #84003

stephentoub commented Mar 28, 2023

ghost commented Mar 28, 2023

IDisposable left a comment

stephentoub commented Mar 28, 2023

IDisposable commented Mar 28, 2023 •

edited

Loading

IDisposable commented Mar 28, 2023

stephentoub commented Mar 28, 2023

stephentoub commented Mar 28, 2023

IDisposable commented Mar 28, 2023 •

edited

Loading

stephentoub commented Mar 28, 2023

EgorBo commented Mar 28, 2023

IDisposable commented Mar 28, 2023

stephentoub commented Mar 28, 2023 •

edited

Loading

Use bit mask in Regex's IsWordChar #84003

Use bit mask in Regex's IsWordChar #84003

Conversation

stephentoub commented Mar 28, 2023

ghost commented Mar 28, 2023

IDisposable left a comment

Choose a reason for hiding this comment

stephentoub commented Mar 28, 2023

IDisposable commented Mar 28, 2023 • edited Loading

IDisposable commented Mar 28, 2023

stephentoub commented Mar 28, 2023

stephentoub commented Mar 28, 2023

IDisposable commented Mar 28, 2023 • edited Loading

stephentoub commented Mar 28, 2023

EgorBo commented Mar 28, 2023

IDisposable commented Mar 28, 2023

stephentoub commented Mar 28, 2023 • edited Loading

IDisposable commented Mar 28, 2023 •

edited

Loading

IDisposable commented Mar 28, 2023 •

edited

Loading

stephentoub commented Mar 28, 2023 •

edited

Loading