Skip to content

Commit

Permalink
feat(regex_parser): Implement RegExp parser (#3824)
Browse files Browse the repository at this point in the history
Part of #1164

## Progress updates 🗞️

Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`.

## TODOs

- [x] `RegExp(Literal = Body + Flags)#parse()` structure
- [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units
- [x] Global `Span` and local offset conversion
- [x] Design AST shapes
  - [x] Keep `enum` size small by `Box<'a, T>`
  - [x] Rework AST shapes
- [x] Split body and flags w/ validating literal
- [x] Parse `RegExpFlags`
- [x] Parse `RegExpBody` = `Pattern`
- [x] Parse `Pattern` > `Disjunction`
- [x] Parse `Disjunction` > `Alternative`
- [x] Parse `Alternative` > `Term`
- [x] Parse `Term` > `Assertion`
	- [x] Parse `BoundaryAssertion`
	- [x] Parse `LookaroundAssertion`
- [x] Parse `Term` > `Quantifier`
- [x] Parse `Term` > `Atom`
	- [x] Parse `Atom` > `PatternCharacter`
	- [x] Parse `Atom` > `.`
	- [x] Parse `Atom` > `\AtomEscape`
		- [x] Parse `\AtomEscape` > `DecimalEscape`
		- [x] Parse `\AtomEscape` > `CharacterClassEscape`
			- [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W`
			- [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}`
		- [x] Parse `\AtomEscape` > `CharacterEscape`
			- [x] Parse `CharacterEscape` > `ControlEscape`
			- [x] Parse `CharacterEscape` > `c AsciiLetter`
			- [x] Parse `CharacterEscape` > `0`
			- [x] Parse `CharacterEscape` > `HexEscapeSequence`
			- [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence`
			- [x] Parse `CharacterEscape` > `IdentityEscape`
		- [x] Parse `\AtomEscape` > `kGroupName`
	- [x] Parse `Atom` > `[CharacterClass]`
    	- [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges`
    	- [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression`
          - [x] Parse `ClassSetExpression` > `ClassUnion`
          - [x] Parse `ClassSetExpression` > `ClassIntersection`
          - [x] Parse `ClassSetExpression` > `ClassSubtraction`
          - [x] Parse `ClassSetExpression` > `ClassSetOperand`
          - [x] Parse `ClassSetExpression` > `ClassSetRange`
          - [x] Parse `ClassSetExpression` > `ClassSetCharacter`
	- [x] Parse `Atom` > `(GroupSpecifier)`
	- [x] Parse `Atom` > `(?:Disjunction)`
- [x] Annex B
    - [x] Parse `QuantifiableAssertion`
	- [x] Parse `ExtendedAtom`
      - [x] Parse `ExtendedAtom` > `\ [lookahead = c]`
      - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier`
      - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter`
      - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence`
- [x] Early errors
	- [x] Pattern :: Disjunction(1/2)
	- [x] Pattern :: Disjunction(2/2)
	- [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits }
	- [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B)
	- [x] AtomEscape :: k GroupName
	- [x] AtomEscape :: DecimalEscape
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2)
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2)
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B)
	- [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence
	- [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
	- [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence
	- [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
	- [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2)
	- [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2)
	- [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2)
	- [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2)
	- [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression }
	- [x] CharacterClass :: [^ ClassContents ]
	- [x] NestedClass :: [^ ClassContents ]
	- [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter
- [x] Add `Span` to `Err(OxcDiagnostic::error())` calls
- [x] Perf improvement
	- [x] `Reader#peek()` should avoid `iter.next()` equivalent
	- [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~
	- [x] ~~Try 1(+1) loop parsing for capturing groups?~~

## Follow up

- [x] @Boshen Test suite > #4242
  - [x] Investigate CI errors...
- Next...
  - Support ES2025 Duplicate named capturing groups?
  - Support ES20XX Stage3 Modifiers?
  • Loading branch information
leaysgur committed Aug 20, 2024
1 parent f88970b commit 368364d
Show file tree
Hide file tree
Showing 26 changed files with 3,893 additions and 435 deletions.
5 changes: 4 additions & 1 deletion .typos.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ extend-exclude = [
]

[default]
extend-ignore-re = ["(?Rm)^.*(#|//)\\s*spellchecker:disable-line$"]
extend-ignore-re = [
"(?Rm)^.*(#|//)\\s*spellchecker:disable-line$",
"(?s)(#|//)\\s*spellchecker:off.*?\\n\\s*(#|//)\\s*spellchecker:on",
]

[default.extend-words]
trivias = "trivias"
Expand Down
20 changes: 12 additions & 8 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

24 changes: 0 additions & 24 deletions crates/oxc_js_regex/Cargo.toml

This file was deleted.

5 changes: 0 additions & 5 deletions crates/oxc_js_regex/README.md

This file was deleted.

Loading

0 comments on commit 368364d

Please sign in to comment.