Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bit mask in Regex's IsWordChar #84003

Merged
merged 1 commit into from
Mar 28, 2023
Merged

Conversation

stephentoub
Copy link
Member

Improves throughput of IsWordChar on non-ASCII chars by ~15%.

Roslyn has also fixed the codegen for is/is not patterns, so the switch expressions used here instead have been switched back.

@ghost
Copy link

ghost commented Mar 28, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Improves throughput of IsWordChar on non-ASCII chars by ~15%.

Roslyn has also fixed the codegen for is/is not patterns, so the switch expressions used here instead have been switched back.

Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub stephentoub changed the title Use Vector128 in Regex's IsWordChar Use bit mask in Regex's IsWordChar Mar 28, 2023
Copy link
Contributor

@IDisposable IDisposable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three blocks of identical code (not counting the source emitting version nor the test) for the WordCategories constant. Could we pull this out somewhere common so we don't eventually get them out of sync?

@stephentoub
Copy link
Member Author

Could we pull this out somewhere common so we don't eventually get them out of sync?

There are three copies of the const. We could remove one of the three by combining two of them into a field const, but the one in the source generator needs to remain separate. The categories that make up \w are also never going to change.

@stephentoub stephentoub merged commit ed921ee into dotnet:main Mar 28, 2023
@stephentoub stephentoub deleted the iswordchar branch March 28, 2023 10:22
@IDisposable
Copy link
Contributor

IDisposable commented Mar 28, 2023

Working on the coalescing of the constant...

Also I noticed a what looks like a typo

 int chDiv8 = ch >> 3;
 return (uint)chDiv8 < (uint)ascii.Length ?
     (ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :
     ((WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0 ||
      (ch == ZeroWidthJoiner | ch == ZeroWidthNonJoiner));   

That last line shoudn't the | ch == actually be || ch == so we're doing a logical OR not a bitwise OR?

@IDisposable
Copy link
Contributor

@stephentoub is there a performance history that explains the duplicate code in IsWordChar(char ch) and IsBoundaryWordChar(char ch)? Specifically, is there any reason we should not just make IsBoundaryWordChar call into IsWordChar like

  public static bool IsBoundaryWordChar(char ch)
  {
      // According to UTS#18 Unicode Regular Expressions (http://www.unicode.org/reports/tr18/)
      // RL 1.4 Simple Word Boundaries  The class of <word_character> includes all Alphabetic
      // values from the Unicode character database, from UnicodeData.txt [UData], plus the U+200C
      // ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER.
      const char ZeroWidthNonJoiner = '\u200C', ZeroWidthJoiner = '\u200D';
      return IsWordChar(ch) || ch == ZeroWidthJoiner || ch == ZeroWidthNonJoiner;
  }

For branch prediction and common data, I would expect the ordering shown above to be more valuable than doing the cheaper, but less likely character equality checks first...

@stephentoub
Copy link
Member Author

Specifically, is there any reason we should not just make IsBoundaryWordChar call into IsWordChar like

Because then you'll be doing the extra boundary checks even in the ASCII case where they're unnecessary.

@stephentoub
Copy link
Member Author

That last line shoudn't the | ch == actually be || ch == so we're doing a logical OR not a bitwise OR?

That is on purpose based on perf testing and the elimination of the branch on average being better.

@IDisposable
Copy link
Contributor

IDisposable commented Mar 28, 2023

Glad I asked! I've got the PR ready then... one other question. Would it be ugly to expose this constant in some way so the Emitter generated code could just reference it? Is there precedent for that sort of construct?

@stephentoub
Copy link
Member Author

Would it be ugly to expose this constant in some way so the Emitter generated code could just reference it?

Yes. We want the emitted code to be as readable as possible, which means it's better for it to read:

           const int WordCategories =
               1 << (int)UnicodeCategory.UppercaseLetter |
               1 << (int)UnicodeCategory.LowercaseLetter |
               1 << (int)UnicodeCategory.TitlecaseLetter |
               1 << (int)UnicodeCategory.ModifierLetter |
               1 << (int)UnicodeCategory.OtherLetter |
               1 << (int)UnicodeCategory.NonSpacingMark |
               1 << (int)UnicodeCategory.DecimalDigitNumber |
               1 << (int)UnicodeCategory.ConnectorPunctuation;

than to read:

           const int WordCategories = 0x4013F;

@EgorBo
Copy link
Member

EgorBo commented Mar 28, 2023

than to read:

Reminds me this

internal static bool IsPrimitiveType(this CorElementType et)
// COR_ELEMENT_TYPE_I1,I2,I4,I8,U1,U2,U4,U8,R4,R8,I,U,CHAR,BOOLEAN
=> ((1 << (int)et) & 0b_0011_0000_0000_0011_1111_1111_1100) != 0;

@IDisposable
Copy link
Contributor

PR submitted, I did make a change in Emitter, but ONLY to change the name to WordCategoriesMask to better match the library code.

My follow-up question for the #84003 (comment) is more along the lines of should we expose a public constant from RegexCharClass that is referenced in the Emitter's output code such that the emitted code looks like

    "    return (uint)chDiv8 < (uint)ascii.Length ?",
    "        (ascii[chDiv8] & (1 << (ch & 0x7))) != 0 :",
    "        (RegexCharClass.WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;",
    "}",
});

@stephentoub
Copy link
Member Author

stephentoub commented Mar 28, 2023

My follow-up question for the #84003 (comment) is more along the lines of should we expose a public constant from RegexCharClass that is referenced in the Emitter's output code such that the emitted code looks like

No, I don't believe there's enough value in that (and RegexCharClass is internal). This is how we currently do the check; that could change in the future, and we'd be exposing public API for what's an implementation detail.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants