-
-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(regex_parser): Implement RegExp
parser
#3824
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
regexpp
for OXCregexpp
CodSpeed Performance ReportMerging #3824 will not alter performanceComparing Summary
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Now that we have some of the implementation working, we should think about how to support the regex eslint rules 🤔 |
regexpp
RegExp
parser
@leaysgur hello, im currently working on a smaller version of regex groups, maybe u find some usefull snippets here: interesting method:
|
This is awesome, I'm looking forward to this PR😍 I always had a theory that there are only 5 people on StackOverflow who write all the regex examples and everyone else just copies them into production. If that theory is correct I bet you'd be the 6th after this😆 |
That's the truth. 😅 |
This comment was marked as outdated.
This comment was marked as outdated.
I hope it becomes an independent crate package. |
@Boshen ^ How do you think? (maybe also related to #4242 (comment)) |
Hello @leaysgur for See:
Did you considered this use case for escaped backslashes? |
@Sysix Thanks for your comment!
The current AST for backref already holds
Yes, but as a RegExp parser, I do not specifically address backslash escaping (rather escape sequences). For now, the treatment of pattern My understanding may be wrong and I'm not sure how OXC parser handle these escapes. 😅 Nope… For this reason, we may need to add new flag and implement a lexer layer to check Or just leave it user land to be called I’m beginning to think about this. 🤔 Hmmm, not so sure. I think I'll wait for @Boshen 's advice. This is summary what need to ask:
|
219fe7e
to
e95c600
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is art.
Merge activity
|
Part of #1164 ## Progress updates 🗞️ Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`. ## TODOs - [x] `RegExp(Literal = Body + Flags)#parse()` structure - [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units - [x] Global `Span` and local offset conversion - [x] Design AST shapes - [x] Keep `enum` size small by `Box<'a, T>` - [x] Rework AST shapes - [x] Split body and flags w/ validating literal - [x] Parse `RegExpFlags` - [x] Parse `RegExpBody` = `Pattern` - [x] Parse `Pattern` > `Disjunction` - [x] Parse `Disjunction` > `Alternative` - [x] Parse `Alternative` > `Term` - [x] Parse `Term` > `Assertion` - [x] Parse `BoundaryAssertion` - [x] Parse `LookaroundAssertion` - [x] Parse `Term` > `Quantifier` - [x] Parse `Term` > `Atom` - [x] Parse `Atom` > `PatternCharacter` - [x] Parse `Atom` > `.` - [x] Parse `Atom` > `\AtomEscape` - [x] Parse `\AtomEscape` > `DecimalEscape` - [x] Parse `\AtomEscape` > `CharacterClassEscape` - [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W` - [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}` - [x] Parse `\AtomEscape` > `CharacterEscape` - [x] Parse `CharacterEscape` > `ControlEscape` - [x] Parse `CharacterEscape` > `c AsciiLetter` - [x] Parse `CharacterEscape` > `0` - [x] Parse `CharacterEscape` > `HexEscapeSequence` - [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence` - [x] Parse `CharacterEscape` > `IdentityEscape` - [x] Parse `\AtomEscape` > `kGroupName` - [x] Parse `Atom` > `[CharacterClass]` - [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges` - [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression` - [x] Parse `ClassSetExpression` > `ClassUnion` - [x] Parse `ClassSetExpression` > `ClassIntersection` - [x] Parse `ClassSetExpression` > `ClassSubtraction` - [x] Parse `ClassSetExpression` > `ClassSetOperand` - [x] Parse `ClassSetExpression` > `ClassSetRange` - [x] Parse `ClassSetExpression` > `ClassSetCharacter` - [x] Parse `Atom` > `(GroupSpecifier)` - [x] Parse `Atom` > `(?:Disjunction)` - [x] Annex B - [x] Parse `QuantifiableAssertion` - [x] Parse `ExtendedAtom` - [x] Parse `ExtendedAtom` > `\ [lookahead = c]` - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier` - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter` - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence` - [x] Early errors - [x] Pattern :: Disjunction(1/2) - [x] Pattern :: Disjunction(2/2) - [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits } - [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B) - [x] AtomEscape :: k GroupName - [x] AtomEscape :: DecimalEscape - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B) - [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2) - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2) - [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression } - [x] CharacterClass :: [^ ClassContents ] - [x] NestedClass :: [^ ClassContents ] - [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter - [x] Add `Span` to `Err(OxcDiagnostic::error())` calls - [x] Perf improvement - [x] `Reader#peek()` should avoid `iter.next()` equivalent - [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~ - [x] ~~Try 1(+1) loop parsing for capturing groups?~~ ## Follow up - [x] @Boshen Test suite > #4242 - [x] Investigate CI errors... - Next... - Support ES2025 Duplicate named capturing groups? - Support ES20XX Stage3 Modifiers?
@Sysix Sorry to bother you from already closed PR. I finally found that we do not need to care about escaped backslash issue you mentioned. Please see But you may still need to wait a little longer to use this in linter. #1164 (comment) |
Part of #1164
Progress updates 🗞️
Waiting for the review and advice, while thinking how to handle escaped string when
new RegExp(pat)
.TODOs
RegExp(Literal = Body + Flags)#parse()
structureReader
impl to handle both unicode(u32) and utf-16(u16) unitsSpan
and local offset conversionenum
size small byBox<'a, T>
RegExpFlags
RegExpBody
=Pattern
Pattern
>Disjunction
Disjunction
>Alternative
Alternative
>Term
Term
>Assertion
BoundaryAssertion
LookaroundAssertion
Term
>Quantifier
Term
>Atom
Atom
>PatternCharacter
Atom
>.
Atom
>\AtomEscape
\AtomEscape
>DecimalEscape
\AtomEscape
>CharacterClassEscape
CharacterClassEscape
>\d, \D, \s, \S, \w, \W
CharacterClassEscape
>\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}
\AtomEscape
>CharacterEscape
CharacterEscape
>ControlEscape
CharacterEscape
>c AsciiLetter
CharacterEscape
>0
CharacterEscape
>HexEscapeSequence
CharacterEscape
>RegExpUnicodeEscapeSequence
CharacterEscape
>IdentityEscape
\AtomEscape
>kGroupName
Atom
>[CharacterClass]
[CharacterClass]
>ClassContents
>[~UnicodeSetsMode] NonemptyClassRanges
[CharacterClass]
>ClassContents
>[+UnicodeSetsMode] ClassSetExpression
ClassSetExpression
>ClassUnion
ClassSetExpression
>ClassIntersection
ClassSetExpression
>ClassSubtraction
ClassSetExpression
>ClassSetOperand
ClassSetExpression
>ClassSetRange
ClassSetExpression
>ClassSetCharacter
Atom
>(GroupSpecifier)
Atom
>(?:Disjunction)
QuantifiableAssertion
ExtendedAtom
ExtendedAtom
>\ [lookahead = c]
ExtendedAtom
>InvalidBracedQuantifier
ExtendedAtom
>ExtendedPatternCharacter
ExtendedAtom
>\AtomEscape
>CharacterEscape
>LegacyOctalEscapeSequence
Span
toErr(OxcDiagnostic::error())
callsReader#peek()
should avoiditer.next()
equivalentUsechar
everywhere and split and push 2 surrogates(pair) forCharacter
?Try 1(+1) loop parsing for capturing groups?Follow up