diff --git a/HACKING b/HACKING index 509808394..488ed721f 100644 --- a/HACKING +++ b/HACKING @@ -199,9 +199,11 @@ META_RANGE_ESCAPED hyphen in class range with at least one escape META_RANGE_LITERAL hyphen in class range defined literally META_SKIP (*SKIP) - no argument (see below for with argument) META_THEN (*THEN) - no argument (see below for with argument) -META_ECLASS_OR || in an extended character class -META_ECLASS_AND && in an extended character class -META_ECLASS_SUB -- in an extended character class +META_ECLASS_AND && (or &) in an extended character class +META_ECLASS_OR || (or |, +) in an extended character class +META_ECLASS_SUB -- (or -) in an extended character class +META_ECLASS_XOR ~~ (or ^) in an extended character class +META_ECLASS_NOT ! in an extended character class The two RANGE values occur only in character classes. They are positioned between two literals that define the start and end of the range. In an EBCDIC diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 747f88ebd..b41ff93f1 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -22,31 +22,32 @@
@@ -1555,38 +1556,82 @@
-The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's "(?[...])" -syntax, allowing instead extended class behaviour inside ordinary "[...]" -character classes. This altered syntax for "[...]" classes is loosely described -by the Unicode standard UTS#18. +PCRE2 supports Perl's (?[...]) extended character class syntax. This can +be used to perform set operations, such intersection.
-Firstly, in Perl syntax, an expression such as "[a[]" is a character class -with two literal characters "a" and "[", but in UTS#18 extended classes the "[" -character becomes an additional metacharacter within classes, denoting the start -of a nested class, so a literal "[" must be escaped as "\[". +The syntax permitted within (?[...]) is quite different to ordinary character +classes. Inside the extended class, there is an expression syntax consisting of +"atoms", operators, and ordinary parentheses "()" used for grouping.
-Secondly, within the UTS#18 extended syntax, there are additional operators -"||", "&&" and "--" which denote character class union, intersection, and -subtraction respectively. In standard Perl syntax, these would simply be -needlessly-repeated literals (except for "-" which can denote a range). These -operators can be used in constructs such as "[\p{L}--[QW]]" for "Unicode -letters, other than Q and W". A literal "-" at the end of a range must be -escaped (so while "[--1]" in Perl syntax is the range from hyphen to "1", it -must be escaped as "[\--1]" in UTS#18 extended classes). +The allowed atoms are any escaped characters or sets such as \n or \d, POSIX +classes such as [:alpha:], and any ordinary character class may be nested as an +atom within an extended class. For example, in (?[\d & [...]]) the nested +ordinary class [...] follows the ordinary rules for character classes, in which +parentheses are not metacharacters, and character literals and ranges are +permitted. +
++However, when outside a nested ordinary character class, such as in +(?[... + (...)]), character literals and ranges may not be used, as they are not +atoms in the extended syntax. The extended syntax does not introduce any +additional escape sequences, so (?[\y]) is an unknown escape, as it would be +inside [\y]. +
++In the extended syntax, ^ does not negate a class (except within an +ordinary class nested inside an extended class); it is instead a binary +operator. +
++The binary operators are "&" (intersection), "|" or "+" (union), "-" +(subtraction) and "^" (symmetric difference). These are left-associative and +"&" has higher (tighter) precedence, while the others have equal lower +precedence. The one prefix unary operator is "!" (complement), with highest +precedence.
-The specific rules in PCRE2 are that classes can be nested: -"[...[B]...[^C]...]". The individual class items (literal characters, literal +A Perl extended character class always has the /xx modifier turned on within +it. +
++The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's (?[...]) +syntax, allowing instead extended class behaviour inside ordinary [...] +character classes. This altered syntax for [...] classes is loosely described +by the Unicode standard UTS#18. (The PCRE2_ALT_EXTENDED_CLASS option does not +prevent use of (?[...]) classes, but only changes the meaning of [...] +classes.) +
++Firstly, in ordinary Perl [...] syntax, an expression such as "[a[]" is a +character class with two literal characters "a" and "[", but in UTS#18 extended +classes the "[" character becomes an additional metacharacter within classes, +denoting the start of a nested class, so a literal "[" must be escaped as "\[". +
++Secondly, within the UTS#18 extended syntax, there are additional operators +"||", "&&", "--" and "~~" which denote character class union, intersection, +subtraction, and symmetric difference respectively. In standard Perl syntax, +these would simply be needlessly-repeated literals (except for "-" which can +denote a range). These operators can be used in constructs such as +[\p{L}--[QW]] for "Unicode letters, other than Q and W". A literal "-" at +the end of a range must be escaped (so while "[--1]" in Perl syntax is the +range from hyphen to "1", it must be escaped as "[\--1]" in UTS#18 extended +classes). +
++The specific rules in PCRE2 are: Classes can be nested, for example +[...[B]...[^C]...]. The individual class items (literal characters, literal ranges, properties such as \d or \p{...}, and nested classes) can be -combined by juxtaposition or by an operator "||", "&&", or "--". +combined by juxtaposition or by an operator "||", "&&", "--", or "~~". Juxtaposition is the implicit union operator, and binds more tightly than any explicit operator. Precedence between the explicit operators is not defined, -so mixing operators is a syntax error (thus "[A&&B--C]" is an error, but -"[A&&[B--C]]" is accepted). +so mixing operators is a syntax error (thus [A&&B--C] is an error, but +[A&&[B--C]] is accepted).
This is an emerging syntax which is being adopted gradually across the regex @@ -1595,8 +1640,8 @@
PCRE2's syntax adds syntax restrictions similar to ECMASCript's /v flag, so @@ -1605,7 +1650,7 @@
Perl supports the POSIX notation for character classes. This uses names enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports @@ -1709,7 +1754,7 @@
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of @@ -1730,7 +1775,7 @@
Vertical bar characters are used to separate alternative patterns. For example, the pattern @@ -1745,7 +1790,7 @@
The settings of several options can be changed within a pattern by a sequence of letters enclosed between "(?" and ")". The following are Perl-compatible, @@ -1843,7 +1888,7 @@
Groups are delimited by parentheses (round brackets), which can be nested. Turning part of a pattern into a group does two things: @@ -1899,7 +1944,7 @@
Perl 5.10 introduced a feature whereby each alternative in a group uses the same numbers for its capturing parentheses. Such a group starts with (?| and is @@ -1945,7 +1990,7 @@
Identifying capture groups by number is simple, but it can be very hard to keep track of the numbers in complicated patterns. Furthermore, if an expression is @@ -2065,7 +2110,7 @@
Repetition is specified by quantifiers, which may follow any one of these items: @@ -2247,7 +2292,7 @@
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") repetition, failure of what follows normally causes the repeated item to be @@ -2358,7 +2403,7 @@
Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a backreference to a capture group earlier (that @@ -2496,7 +2541,7 @@
An assertion is a test that does not consume any characters. The test must succeed for the match to continue. The simple assertions coded as \b, \B, @@ -2746,7 +2791,7 @@
Traditional lookaround assertions are atomic. That is, if an assertion is true, but there is a subsequent matching failure, there is no backtracking into the @@ -2806,7 +2851,7 @@
A special kind of assertion, not compatible with Perl, makes it possible to check the contents of a captured substring by matching it with a subpattern. @@ -2865,7 +2910,7 @@
In concept, a script run is a sequence of characters that are all from the same Unicode script such as Latin or Greek. However, because some scripts are @@ -2927,7 +2972,7 @@
It is possible to cause the matching process to obey a pattern fragment conditionally or to choose between two alternative fragments, depending on @@ -3128,7 +3173,7 @@
There are two ways of including comments in patterns that are processed by PCRE2. In both cases, the start of the comment must not be in a character @@ -3158,7 +3203,7 @@
Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can @@ -3346,7 +3391,7 @@
If the syntax for a recursive group call (either by number or by name) is used outside the parentheses to which it refers, it operates a bit like a subroutine @@ -3394,7 +3439,7 @@
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative @@ -3412,7 +3457,7 @@
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it @@ -3490,7 +3535,7 @@
There are a number of special "Backtracking Control Verbs" (to use Perl's terminology) that modify the behaviour of backtracking during matching. They @@ -4017,12 +4062,12 @@
pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3).
-
Philip Hazel
@@ -4031,9 +4076,9 @@
-Last updated: 21 September 2024
+Last updated: 08 November 2024
Copyright © 1997-2024 University of Cambridge.
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
index 91386ad97..a6d89b94e 100644
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -24,29 +24,30 @@
@@ -311,7 +312,45 @@
+When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be +used, allowing nested character classes, combined using set operators. +
+ [x&&[^y]] UTS#18 extended character class + + x||y set union (OR) + x&&y set intersection (AND) + x--y set difference (AND NOT) + x~~y set symmetric difference (XOR) + ++ +
+
+ (?[...]) Perl extended character class + (?[\p{Thai} & \p{Nd}]) operators; whitespace ignored + (?[(x - y) & z]) parentheses for grouping + + (?[ [^3] & \p{Nd} ]) [...] is a nested ordinary class + (?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...] + (?[ \d - [3] ]) backslash-escaped set is allowed outside [...] + (?[ !\n & [:ascii:] ]) backslash-escaped character is allowed outside [...] + all other characters or ranges must be enclosed in [...] + + x|y, x+y set union (OR) + x&y set intersection (AND) + x-y set difference (AND NOT) + x^y set symmetric difference (XOR) + !x set complement (NOT) ++Inside a Perl extended character class, [...] switches mode to be interpreted +as an ordinary character class. Outside of a nested [...], the only items +permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside +a nested ordinary class, ^ has its usual meaning (inverts the class when used +as the first character); outside of a nested class, ^ is the XOR operator. + +
? 0 or 1, greedy @@ -335,7 +374,7 @@-pcre2syntax man page
{,m}? zero up to m, lazy
\b word boundary @@ -353,7 +392,7 @@-pcre2syntax man page
\G first matching position in subject
\K set reported start of match @@ -363,13 +402,13 @@-pcre2syntax man page
option is set, the previous behaviour is re-enabled. When this option is set, \K is honoured in positive assertions, but ignored in negative ones. -
ALTERNATION
+
ALTERNATION
expr|expr|expr...-
CAPTURING
+
CAPTURING
(...) capture group @@ -384,20 +423,20 @@-pcre2syntax man page
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In both cases, a name must not start with a digit. -
ATOMIC GROUPS
+
ATOMIC GROUPS
(?>...) atomic non-capture group (*atomic:...) atomic non-capture group-
COMMENT
+
COMMENT
(?#....) comment (not nestable)-
OPTION SETTING
+
OPTION SETTING
Changes of these options within a group are automatically cancelled at the end of the group. @@ -456,7 +495,7 @@
pcre2syntax man page
application can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. -
NEWLINE CONVENTION
+
NEWLINE CONVENTION
These are recognized only at the very start of the pattern or after option settings with a similar syntax. @@ -469,7 +508,7 @@
pcre2syntax man page
(*NUL) the NUL character (binary zero)
WHAT \R MATCHES
+
WHAT \R MATCHES
These are recognized only at the very start of the pattern or after option setting with a similar syntax. @@ -478,7 +517,7 @@
pcre2syntax man page
(*BSR_UNICODE) any Unicode newline sequence
(?=...) ) @@ -504,7 +543,7 @@pcre2syntax man page
(ultimate default 255). If every branch matches a fixed number of characters, the limit for each branch is 65535 characters. -
NON-ATOMIC LOOKAROUND ASSERTIONS
+
NON-ATOMIC LOOKAROUND ASSERTIONS
These assertions are specific to PCRE2 and are not Perl-compatible.
@@ -517,7 +556,7 @@-pcre2syntax man page
(*non_atomic_positive_lookbehind:...) )
SUBSTRING SCAN ASSERTION
+
SUBSTRING SCAN ASSERTION
This feature is not Perl-compatible.
@@ -534,7 +573,7 @@-pcre2syntax man page
SCRIPT RUNS
+
SCRIPT RUNS
(*script_run:...) ) script run, can be backtracked into @@ -544,7 +583,7 @@-pcre2syntax man page
(*asr:...) )
BACKREFERENCES
+
BACKREFERENCES
\n reference by number (can be ambiguous) @@ -561,7 +600,7 @@-pcre2syntax man page
(?P=name) reference by name (Python)
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
+
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
(?R) recurse whole pattern @@ -580,7 +619,7 @@-pcre2syntax man page
\g'-n' call subroutine by relative number (PCRE2 extension)
CONDITIONAL PATTERNS
+
CONDITIONAL PATTERNS
(?(condition)yes-pattern) @@ -603,7 +642,7 @@pcre2syntax man page
conditions or recursion tests. Such a condition is interpreted as a reference condition if the relevant named group exists. -
BACKTRACKING CONTROL
+
BACKTRACKING CONTROL
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the name is mandatory, for the others it is optional. (*SKIP) changes its behaviour @@ -630,7 +669,7 @@
pcre2syntax man page
The effect of one of these verbs in a group called as a subroutine is confined to the subroutine call. -
CALLOUTS
+
CALLOUTS
(?C) callout (assumed number 0) @@ -641,7 +680,7 @@pcre2syntax man page
start and the end), and the starting delimiter { matched with the ending delimiter }. To encode the ending delimiter within the string, double it. -
REPLACEMENT STRINGS
+
REPLACEMENT STRINGS
If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for pcre2_substitute() is not interpreted. Otherwise, by default, the only @@ -687,12 +726,12 @@
pcre2syntax man page
The substitution strings themselves are expanded. Backslash can be used to escape colons and closing curly brackets. -
SEE ALSO
+
SEE ALSO
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2(3).
-
AUTHOR
+
AUTHOR
Philip Hazel
@@ -701,9 +740,9 @@pcre2syntax man page
Cambridge, England.
-
REVISION
+
REVISION
-Last updated: 20 October 2024 +Last updated: 08 November 2024
Copyright © 1997-2024 University of Cambridge.
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index d17ce1f9b..b61246eb0 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "21 Sepbember 2024" "PCRE2 10.45" +.TH PCRE2PATTERN 3 "08 November 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -1547,35 +1547,74 @@ the next two sections), and the terminating closing square bracket. However, escaping other non-alphanumeric characters does no harm. . . -.SH "UTS#18 EXTENDED CHARACTER CLASSES" +.SH "PERL EXTENDED CHARACTER CLASSES" .rs -The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's "(?[...])" -syntax, allowing instead extended class behaviour inside ordinary "[...]" -character classes. This altered syntax for "[...]" classes is loosely described -by the Unicode standard UTS#18. +PCRE2 supports Perl's (?[...]) extended character class syntax. This can +be used to perform set operations, such intersection. +.P +The syntax permitted within (?[...]) is quite different to ordinary character +classes. Inside the extended class, there is an expression syntax consisting of +"atoms", operators, and ordinary parentheses "()" used for grouping. +.P +The allowed atoms are any escaped characters or sets such as \en or \ed, POSIX +classes such as [:alpha:], and any ordinary character class may be nested as an +atom within an extended class. For example, in (?[\ed & [...]]) the nested +ordinary class [...] follows the ordinary rules for character classes, in which +parentheses are not metacharacters, and character literals and ranges are +permitted. +.P +However, when outside a nested ordinary character class, such as in +(?[... + (...)]), character literals and ranges may not be used, as they are not +atoms in the extended syntax. The extended syntax does not introduce any +additional escape sequences, so (?[\ey]) is an unknown escape, as it would be +inside [\ey]. +.P +In the extended syntax, ^ does not negate a class (except within an +ordinary class nested inside an extended class); it is instead a binary +operator. .P -Firstly, in Perl syntax, an expression such as "[a[]" is a character class -with two literal characters "a" and "[", but in UTS#18 extended classes the "[" -character becomes an additional metacharacter within classes, denoting the start -of a nested class, so a literal "[" must be escaped as "\e[". +The binary operators are "&" (intersection), "|" or "+" (union), "-" +(subtraction) and "^" (symmetric difference). These are left-associative and +"&" has higher (tighter) precedence, while the others have equal lower +precedence. The one prefix unary operator is "!" (complement), with highest +precedence. +.P +A Perl extended character class always has the /xx modifier turned on within +it. +. +. +.SH "UTS#18 EXTENDED CHARACTER CLASSES" +.rs +The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's (?[...]) +syntax, allowing instead extended class behaviour inside ordinary [...] +character classes. This altered syntax for [...] classes is loosely described +by the Unicode standard UTS#18. (The PCRE2_ALT_EXTENDED_CLASS option does not +prevent use of (?[...]) classes, but only changes the meaning of [...] +classes.) +.P +Firstly, in ordinary Perl [...] syntax, an expression such as "[a[]" is a +character class with two literal characters "a" and "[", but in UTS#18 extended +classes the "[" character becomes an additional metacharacter within classes, +denoting the start of a nested class, so a literal "[" must be escaped as "\e[". .P Secondly, within the UTS#18 extended syntax, there are additional operators -"||", "&&" and "--" which denote character class union, intersection, and -subtraction respectively. In standard Perl syntax, these would simply be -needlessly-repeated literals (except for "-" which can denote a range). These -operators can be used in constructs such as "[\ep{L}--[QW]]" for "Unicode -letters, other than Q and W". A literal "-" at the end of a range must be -escaped (so while "[--1]" in Perl syntax is the range from hyphen to "1", it -must be escaped as "[\e--1]" in UTS#18 extended classes). -.P -The specific rules in PCRE2 are that classes can be nested: -"[...[B]...[^C]...]". The individual class items (literal characters, literal +"||", "&&", "--" and "~~" which denote character class union, intersection, +subtraction, and symmetric difference respectively. In standard Perl syntax, +these would simply be needlessly-repeated literals (except for "-" which can +denote a range). These operators can be used in constructs such as +[\ep{L}--[QW]] for "Unicode letters, other than Q and W". A literal "-" at +the end of a range must be escaped (so while "[--1]" in Perl syntax is the +range from hyphen to "1", it must be escaped as "[\e--1]" in UTS#18 extended +classes). +.P +The specific rules in PCRE2 are: Classes can be nested, for example +[...[B]...[^C]...]. The individual class items (literal characters, literal ranges, properties such as \ed or \ep{...}, and nested classes) can be -combined by juxtaposition or by an operator "||", "&&", or "--". +combined by juxtaposition or by an operator "||", "&&", "--", or "~~". Juxtaposition is the implicit union operator, and binds more tightly than any explicit operator. Precedence between the explicit operators is not defined, -so mixing operators is a syntax error (thus "[A&&B--C]" is an error, but -"[A&&[B--C]]" is accepted). +so mixing operators is a syntax error (thus [A&&B--C] is an error, but +[A&&[B--C]] is accepted). .P This is an emerging syntax which is being adopted gradually across the regex ecosystem: for example JavaScript adopted the "/v" flag in ECMAScript 2024; @@ -1583,8 +1622,8 @@ Python's "re" module reserves the syntax for future use with a FutureWarning for unescaped use of "[" as a literal within character classes. Due to UTS#18 providing insufficient guidance, engines interpret the syntax differently. Rust's "regex" crate and Python's "regex" PyPi module both implement UTS#18 -extended classes, but with slight incompatibilities ("[A||B&&C]" is parsed as -"[A||[B&&C]]" in Python's "regex" but as "[[A||B]&&C]" in Rust's "regex"). +extended classes, but with slight incompatibilities ([A||B&&C] is parsed as +[A||[B&&C]] in Python's "regex" but as [[A||B]&&C] in Rust's "regex"). .P PCRE2's syntax adds syntax restrictions similar to ECMASCript's /v flag, so that all the extended classes accepted as valid by PCRE2 have the property @@ -4075,6 +4114,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 21 September 2024 +Last updated: 08 November 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index 4e769f1fb..a1a685dc7 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -1,4 +1,4 @@ -.TH PCRE2SYNTAX 3 "20 October 2024" "PCRE2 10.45" +.TH PCRE2SYNTAX 3 "08 November 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" @@ -284,6 +284,43 @@ The recognized classes are: In PCRE2, POSIX character set names recognize only ASCII characters by default, but some of them use Unicode properties if PCRE2_UCP is set. You can use \eQ...\eE inside a character class. +.P +When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be +used, allowing nested character classes, combined using set operators. +.sp + [x&&[^y]] UTS#18 extended character class +.sp + x||y set union (OR) + x&&y set intersection (AND) + x--y set difference (AND NOT) + x~~y set symmetric difference (XOR) +.sp +. +. +.SH "PERL EXTENDED CHARACTER CLASSES" +.rs +.sp + (?[...]) Perl extended character class + (?[\ep{Thai} & \ep{Nd}]) operators; whitespace ignored + (?[(x - y) & z]) parentheses for grouping +.sp + (?[ [^3] & \ep{Nd} ]) [...] is a nested ordinary class + (?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...] + (?[ \ed - [3] ]) backslash-escaped set is allowed outside [...] + (?[ !\en & [:ascii:] ]) backslash-escaped character is allowed outside [...] + all other characters or ranges must be enclosed in [...] +.sp + x|y, x+y set union (OR) + x&y set intersection (AND) + x-y set difference (AND NOT) + x^y set symmetric difference (XOR) + !x set complement (NOT) +.sp +Inside a Perl extended character class, [...] switches mode to be interpreted +as an ordinary character class. Outside of a nested [...], the only items +permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside +a nested ordinary class, ^ has its usual meaning (inverts the class when used +as the first character); outside of a nested class, ^ is the XOR operator. . . .SH "QUANTIFIERS" @@ -692,6 +729,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 20 October 2024 +Last updated: 08 November 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/src/pcre2.h.generic b/src/pcre2.h.generic index b4b461e93..af18d4218 100644 --- a/src/pcre2.h.generic +++ b/src/pcre2.h.generic @@ -339,6 +339,10 @@ pcre2_pattern_convert(). */ #define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210 #define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211 #define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213 +#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214 +#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216 /* "Expected" matching error codes: no match and partial match. */ diff --git a/src/pcre2.h.in b/src/pcre2.h.in index 89ccfae56..fff5f2cd1 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -339,6 +339,10 @@ pcre2_pattern_convert(). */ #define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210 #define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211 #define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213 +#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214 +#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216 /* "Expected" matching error codes: no match and partial match. */ diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index caf34a66f..b06f716a3 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -260,9 +260,11 @@ static unsigned char meta_extra_lengths[] = { 2, /* META_MINMAX */ 2, /* META_MINMAX_PLUS */ 2, /* META_MINMAX_QUERY */ - 0, /* META_ECLASS_OR */ 0, /* META_ECLASS_AND */ - 0 /* META_ECLASS_SUB */ + 0, /* META_ECLASS_OR */ + 0, /* META_ECLASS_SUB */ + 0, /* META_ECLASS_XOR */ + 0 /* META_ECLASS_NOT */ }; /* Types for skipping parts of a parsed pattern. */ @@ -1077,9 +1079,11 @@ for (;;) fprintf(stderr, ") length=%u", length); break; - case META_ECLASS_OR: fprintf(stderr, "META_ECLASS_OR"); break; case META_ECLASS_AND: fprintf(stderr, "META_ECLASS_AND"); break; + case META_ECLASS_OR: fprintf(stderr, "META_ECLASS_OR"); break; case META_ECLASS_SUB: fprintf(stderr, "META_ECLASS_SUB"); break; + case META_ECLASS_XOR: fprintf(stderr, "META_ECLASS_XOR"); break; + case META_ECLASS_NOT: fprintf(stderr, "META_ECLASS_NOT"); break; } fprintf(stderr, "\n"); } @@ -1273,7 +1277,7 @@ if (allow_sign >= 0 && sign != 0) } if (sign > 0) n += allow_sign; - else if ((int)n > allow_sign) + else if (n > (uint32_t)allow_sign) { *errorcodeptr = ERR15; /* Non-existent subpattern */ goto EXIT; @@ -2826,13 +2830,28 @@ enum { RANGE_STARTED, /* State after '[1-'; last-emitted code is META_RANGE_XYZ */ RANGE_FORBID_NO, /* State after '[\d'; '-]' is allowed but not '-1]' */ RANGE_FORBID_STARTED, /* State after '[\d-'*/ - RANGE_OK_ESCAPED, /* State after '[1'; hyphen may be a range */ - RANGE_OK_LITERAL /* State after '[\1'; hyphen may be a range */ + RANGE_OK_ESCAPED, /* State after '[\1'; hyphen may be a range */ + RANGE_OK_LITERAL /* State after '[1'; hyphen may be a range */ +}; + +/* States used for analyzing operators and operands in extended character +classes. */ + +enum { + CLASS_OP_EMPTY, /* At start of an expression; empty previous contents */ + CLASS_OP_OPERAND, /* Have preceding operand; after "z" a "--" can follow */ + CLASS_OP_OPERATOR /* Have preceding operator; after "--" operand must follow */ }; -/* States used for analyzing operators and operands in character classes. */ +/* States used for determining the parse mode in character classes. The two +PERL_EXT values must be last. */ -enum { CLASS_OP_NONE, CLASS_OP_OPERAND, CLASS_OP_OPERATOR }; +enum { + CLASS_MODE_NORMAL, /* Ordinary PCRE2 '[...]' class. */ + CLASS_MODE_ALT_EXT, /* UTS#18-style extended '[...]' class. */ + CLASS_MODE_PERL_EXT, /* Perl extended '(?[...])' class. */ + CLASS_MODE_PERL_EXT_LEAF /* Leaf within extended '(?[ [...] ])' class. */ +}; /* Only in 32-bit mode can there be literals > META_END. A macro encapsulates the storing of literal values in the main parsed pattern, where they can always @@ -2859,6 +2878,7 @@ uint32_t delimiter; uint32_t namelen; uint32_t class_range_state; uint32_t class_op_state; +uint32_t class_mode_state; uint32_t *class_start; uint32_t *verblengthptr = NULL; /* Value avoids compiler warning */ uint32_t *verbstartptr = NULL; @@ -3543,7 +3563,7 @@ while (ptr < ptrend) if (!prev_okquantifier) { errorcode = ERR9; - goto FAILED_BACK; + goto FAILED_BACK; // TODO https://github.com/PCRE2Project/pcre2/issues/549 } /* Most (*VERB)s are not allowed to be quantified, but an ungreedy @@ -3578,14 +3598,6 @@ while (ptr < ptrend) /* ---- Character class ---- */ case CHAR_LEFT_SQUARE_BRACKET: - okquantifier = TRUE; - - /* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/536 - We shall support Perl's (?[...]) syntax. We need a variable class_perlext = true - and a goto jumping here if we see "(?[...". We need to check for closing "])" and - also implement the completely idiosyncratic nesting and operator rules in this - mode. We can hopefully emit exactly the same META codes as for the UTS#18 - syntax, so that only parser changes are required for the Perl syntax. */ /* In another (POSIX) regex library, the ugly syntax [[:<:]] and [[:>:]] is used for "start of word" and "end of word". As these are otherwise illegal @@ -3623,6 +3635,7 @@ while (ptr < ptrend) } *parsed_pattern++ = META_KET; ptr += 6; + okquantifier = TRUE; break; } @@ -3637,6 +3650,15 @@ while (ptr < ptrend) goto FAILED; } + class_mode_state = ((options & PCRE2_ALT_EXTENDED_CLASS) != 0)? + CLASS_MODE_ALT_EXT : CLASS_MODE_NORMAL; + + /* Jump here from '(?[...])'. That jump must initialize class_mode_state, + set c to the '[' character, and ptr to just after the '['. */ + + FROM_PERL_EXTENDED_CLASS: + okquantifier = TRUE; + /* In an EBCDIC environment, Perl treats alphabetic ranges specially because there are holes in the encoding, and simply using the range A-Z (for example) would include the characters in the holes. This applies only @@ -3646,14 +3668,14 @@ while (ptr < ptrend) ranges. */ /* Loop for the contents of the class. Classes may be nested, if - PCRE2_ALT_EXTENDED_CLASS is set. */ + PCRE2_ALT_EXTENDED_CLASS is set, or the class is of the form (?[...]). */ /* c is still set to '[' so the loop will handle the start of the class. */ class_depth_m1 = -1; class_maxdepth_m1 = -1; class_range_state = RANGE_NO; - class_op_state = CLASS_OP_NONE; + class_op_state = CLASS_OP_EMPTY; class_start = NULL; for (;;) @@ -3670,13 +3692,26 @@ while (ptr < ptrend) ptr++; /* Skip the 'E' */ goto CLASS_CONTINUE; } + + /* Surprisingly, you cannot use \Q..\E to escape a character inside a + Perl extended class. However, empty \Q\E sequences are allowed, so here + were're only giving an error if the \Q..\E is non-empty. */ + + if (class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR116; + goto FAILED; + } + goto CLASS_LITERAL; } - /* Skip over space and tab (only) in extended-more mode. */ + /* Skip over space and tab (only) in extended-more mode, or anywhere + inside a Perl extended class (which implies /xx). */ - if ((options & PCRE2_EXTENDED_MORE) != 0 && - (c == CHAR_SPACE || c == CHAR_HT)) + if ((c == CHAR_SPACE || c == CHAR_HT) && + ((options & PCRE2_EXTENDED_MORE) != 0 || + class_mode_state >= CLASS_MODE_PERL_EXT)) goto CLASS_CONTINUE; /* Handle POSIX class names. Perl allows a negation extension of the @@ -3722,6 +3757,16 @@ while (ptr < ptrend) goto FAILED; } + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + ptr = tempptr + 2; + errorcode = ERR113; + goto FAILED; + } + if (*ptr != CHAR_COLON) { ptr = tempptr + 2; @@ -3793,15 +3838,39 @@ while (ptr < ptrend) /* Check for the start of the outermost class, or the start of a nested class. */ - else if (c == CHAR_LEFT_SQUARE_BRACKET && - (class_depth_m1 < 0 || (options & PCRE2_ALT_EXTENDED_CLASS) != 0)) + else if ((c == CHAR_LEFT_SQUARE_BRACKET && + (class_depth_m1 < 0 || class_mode_state == CLASS_MODE_ALT_EXT || + class_mode_state == CLASS_MODE_PERL_EXT)) || + (c == CHAR_LEFT_PARENTHESIS && + class_mode_state == CLASS_MODE_PERL_EXT)) { + uint32_t start_c = c; + uint32_t new_class_mode_state; + + /* Update the class mode, if moving into a 'leaf' inside a Perl extended + class. */ + + if (start_c == CHAR_LEFT_SQUARE_BRACKET && + class_mode_state == CLASS_MODE_PERL_EXT && class_depth_m1 >= 0) + new_class_mode_state = CLASS_MODE_PERL_EXT_LEAF; + else + new_class_mode_state = class_mode_state; + /* Tidy up the other class before starting the nested class. */ /* -[ beginning a nested class is a literal '-' */ if (class_range_state == RANGE_STARTED) parsed_pattern[-1] = CHAR_MINUS; + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR113; + goto FAILED; + } + /* Validate nesting depth */ if (class_depth_m1 >= ECLASS_NEST_LIMIT - 1) { @@ -3819,12 +3888,16 @@ while (ptr < ptrend) { if (ptr >= ptrend) { - errorcode = ERR6; /* Missing terminating ']' */ + if (start_c == CHAR_LEFT_PARENTHESIS) + errorcode = ERR14; /* Missing terminating ')' */ + else + errorcode = ERR6; /* Missing terminating ']' */ goto FAILED; } GETCHARINCTEST(c, ptr); - if (c == CHAR_BACKSLASH) + if (new_class_mode_state == CLASS_MODE_PERL_EXT) break; + else if (c == CHAR_BACKSLASH) { if (ptr < ptrend && *ptr == CHAR_E) ptr++; else if (ptrend - ptr >= 3 && @@ -3833,8 +3906,9 @@ while (ptr < ptrend) else break; } - else if ((options & PCRE2_EXTENDED_MORE) != 0 && - (c == CHAR_SPACE || c == CHAR_HT)) /* Note: just these two */ + else if ((c == CHAR_SPACE || c == CHAR_HT) && /* Note: just these two */ + ((options & PCRE2_EXTENDED_MORE) != 0 || + new_class_mode_state >= CLASS_MODE_PERL_EXT)) continue; else if (!negate_class && c == CHAR_CIRCUMFLEX_ACCENT) negate_class = TRUE; @@ -3842,11 +3916,15 @@ while (ptr < ptrend) } /* Now the real contents of the class; c has the first "real" character. - Empty classes are permitted only if the option is set. */ + Empty classes are permitted only if the option is set, and if it's not + a Perl-extended class. */ if (c == CHAR_RIGHT_SQUARE_BRACKET && - (cb->external_options & PCRE2_ALLOW_EMPTY_CLASS) != 0) + (cb->external_options & PCRE2_ALLOW_EMPTY_CLASS) != 0 && + new_class_mode_state < CLASS_MODE_PERL_EXT) { + PCRE2_ASSERT(start_c == CHAR_LEFT_SQUARE_BRACKET); + if (class_start != NULL) { PCRE2_ASSERT(class_depth_m1 >= 0); @@ -3879,7 +3957,8 @@ while (ptr < ptrend) class_start = parsed_pattern; *parsed_pattern++ = negate_class? META_CLASS_NOT : META_CLASS; class_range_state = RANGE_NO; - class_op_state = CLASS_OP_NONE; + class_op_state = CLASS_OP_EMPTY; + class_mode_state = new_class_mode_state; ++class_depth_m1; if (class_maxdepth_m1 < class_depth_m1) class_maxdepth_m1 = class_depth_m1; @@ -3887,7 +3966,8 @@ while (ptr < ptrend) cb->class_op_used[class_depth_m1] = 0; /* Implement the special start-of-class literal meaning of ']'. */ - if (c == CHAR_RIGHT_SQUARE_BRACKET) + if (c == CHAR_RIGHT_SQUARE_BRACKET && + new_class_mode_state != CLASS_MODE_PERL_EXT) { class_range_state = RANGE_OK_LITERAL; class_op_state = CLASS_OP_OPERAND; @@ -3900,8 +3980,25 @@ while (ptr < ptrend) /* Check for the end of the class. */ - else if (c == CHAR_RIGHT_SQUARE_BRACKET) + else if (c == CHAR_RIGHT_SQUARE_BRACKET || + (c == CHAR_RIGHT_PARENTHESIS && class_mode_state == CLASS_MODE_PERL_EXT)) { + /* In Perl extended mode, the ']' can only be used to match the + opening '[', and ')' must match an opening parenthesis. */ + if (class_mode_state == CLASS_MODE_PERL_EXT) + { + if (c == CHAR_RIGHT_SQUARE_BRACKET && class_depth_m1 != 0) + { + errorcode = ERR14; + goto FAILED_BACK; + } + if (c == CHAR_RIGHT_PARENTHESIS && class_depth_m1 < 1) + { + errorcode = ERR22; + goto FAILED; + } + } + /* Check no trailing operator. */ if (class_op_state == CLASS_OP_OPERATOR) { @@ -3909,25 +4006,113 @@ while (ptr < ptrend) goto FAILED; } + /* Check no empty expression for Perl extended expressions. */ + if (class_mode_state == CLASS_MODE_PERL_EXT && + class_op_state == CLASS_OP_EMPTY) + { + errorcode = ERR114; + goto FAILED; + } + /* -] at the end of a class is a literal '-' */ if (class_range_state == RANGE_STARTED) parsed_pattern[-1] = CHAR_MINUS; *parsed_pattern++ = META_CLASS_END; - if (--class_depth_m1 < 0) break; + if (--class_depth_m1 < 0) + { + /* Check for and consume ')' after '(?[...]'. */ + PCRE2_ASSERT(class_mode_state != CLASS_MODE_PERL_EXT_LEAF); + if (class_mode_state == CLASS_MODE_PERL_EXT) + { + if (ptr >= ptrend || *ptr != CHAR_RIGHT_PARENTHESIS) + { + errorcode = ERR115; + goto FAILED; + } + + ptr++; + } + + break; + } class_range_state = RANGE_NO; /* for processing the containing class */ class_op_state = CLASS_OP_OPERAND; + if (class_mode_state == CLASS_MODE_PERL_EXT_LEAF) + class_mode_state = CLASS_MODE_PERL_EXT; /* The extended class flag has already been set for the parent class. */ class_start = NULL; } - /* Handle a set operator */ + /* Handle a Perl set binary operator */ + + else if (class_mode_state == CLASS_MODE_PERL_EXT && + (c == CHAR_PLUS || c == CHAR_VERTICAL_LINE || c == CHAR_MINUS || + c == CHAR_AMPERSAND || c == CHAR_CIRCUMFLEX_ACCENT)) + { + /* Check for a preceding operand. */ + if (class_op_state != CLASS_OP_OPERAND) + { + errorcode = ERR109; + goto FAILED; + } + + if (class_start != NULL) + { + PCRE2_ASSERT(class_depth_m1 >= 0); + /* Represents that the class is an extended class. */ + *class_start |= CLASS_IS_ECLASS; + class_start = NULL; + } + + PCRE2_ASSERT(class_range_state != RANGE_STARTED && + class_range_state != RANGE_FORBID_STARTED); + + *parsed_pattern++ = c == CHAR_PLUS? META_ECLASS_OR : + c == CHAR_VERTICAL_LINE? META_ECLASS_OR : + c == CHAR_MINUS? META_ECLASS_SUB : + c == CHAR_AMPERSAND? META_ECLASS_AND : + META_ECLASS_XOR; + class_range_state = RANGE_NO; + class_op_state = CLASS_OP_OPERATOR; + } + + /* Handle a Perl set unary operator */ + + else if (class_mode_state == CLASS_MODE_PERL_EXT && + c == CHAR_EXCLAMATION_MARK) + { + /* Check for no preceding operand. */ + if (class_op_state == CLASS_OP_OPERAND) + { + errorcode = ERR113; + goto FAILED; + } + + if (class_start != NULL) + { + PCRE2_ASSERT(class_depth_m1 >= 0); + /* Represents that the class is an extended class. */ + *class_start |= CLASS_IS_ECLASS; + class_start = NULL; + } + + PCRE2_ASSERT(class_range_state != RANGE_STARTED && + class_range_state != RANGE_FORBID_STARTED); + + *parsed_pattern++ = META_ECLASS_NOT; + class_range_state = RANGE_NO; + class_op_state = CLASS_OP_OPERATOR; + } + + /* Handle a UTS#18 set operator */ - else if ((options & PCRE2_ALT_EXTENDED_CLASS) != 0 && - (c == CHAR_VERTICAL_LINE || c == CHAR_MINUS || c == CHAR_AMPERSAND) && + else if (class_mode_state == CLASS_MODE_ALT_EXT && + (c == CHAR_VERTICAL_LINE || c == CHAR_MINUS || + c == CHAR_AMPERSAND || c == CHAR_TILDE) && ptr < ptrend && *ptr == c) { ++ptr; @@ -3969,71 +4154,16 @@ while (ptr < ptrend) *parsed_pattern++ = c == CHAR_VERTICAL_LINE? META_ECLASS_OR : c == CHAR_MINUS? META_ECLASS_SUB : - META_ECLASS_AND; + c == CHAR_AMPERSAND? META_ECLASS_AND : + META_ECLASS_XOR; class_range_state = RANGE_NO; class_op_state = CLASS_OP_OPERATOR; cb->class_op_used[class_depth_m1] = (uint8_t)c; } - /* Handle potential start of range */ - - else if (c == CHAR_MINUS && class_range_state >= RANGE_OK_ESCAPED) - { - *parsed_pattern++ = (class_range_state == RANGE_OK_LITERAL)? - META_RANGE_LITERAL : META_RANGE_ESCAPED; - class_range_state = RANGE_STARTED; - } - - /* Handle forbidden start of range */ - - else if (c == CHAR_MINUS && class_range_state == RANGE_FORBID_NO) - { - *parsed_pattern++ = CHAR_MINUS; - class_range_state = RANGE_FORBID_STARTED; - class_range_forbid_ptr = ptr; - } - - /* Handle a literal character */ - - else if (c != CHAR_BACKSLASH) - { - CLASS_LITERAL: - if (class_range_state == RANGE_STARTED) - { - if (c == parsed_pattern[-2]) /* Optimize one-char range */ - parsed_pattern--; - else if (parsed_pattern[-2] > c) /* Check range is in order */ - { - errorcode = ERR8; - goto FAILED_BACK; - } - else - { - if (!char_is_literal && parsed_pattern[-1] == META_RANGE_LITERAL) - parsed_pattern[-1] = META_RANGE_ESCAPED; - PARSED_LITERAL(c, parsed_pattern); - } - class_range_state = RANGE_NO; - class_op_state = CLASS_OP_OPERAND; - } - else if (class_range_state == RANGE_FORBID_STARTED) - { - ptr = class_range_forbid_ptr; - errorcode = ERR50; - goto FAILED; - } - else /* Potential start of range */ - { - class_range_state = char_is_literal? - RANGE_OK_LITERAL : RANGE_OK_ESCAPED; - class_op_state = CLASS_OP_OPERAND; - PARSED_LITERAL(c, parsed_pattern); - } - } - /* Handle escapes in a class */ - else + else if (c == CHAR_BACKSLASH) { tempptr = ptr; escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options, @@ -4041,7 +4171,8 @@ while (ptr < ptrend) if (errorcode != 0) { - if ((xoptions & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0) + if ((xoptions & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0 || + class_mode_state >= CLASS_MODE_PERL_EXT) goto FAILED; ptr = tempptr; if (ptr >= ptrend) c = CHAR_BACKSLASH; else @@ -4151,20 +4282,106 @@ while (ptr < ptrend) goto FAILED; } + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR113; + goto FAILED; + } + class_range_state = RANGE_FORBID_NO; class_op_state = CLASS_OP_OPERAND; } + /* Forbid unescaped literals, and the special meaning of '-', inside a + Perl extended class. */ + + else if (class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR116; + goto FAILED; + } + + /* Handle potential start of range */ + + else if (c == CHAR_MINUS && class_range_state >= RANGE_OK_ESCAPED) + { + *parsed_pattern++ = (class_range_state == RANGE_OK_LITERAL)? + META_RANGE_LITERAL : META_RANGE_ESCAPED; + class_range_state = RANGE_STARTED; + } + + /* Handle forbidden start of range */ + + else if (c == CHAR_MINUS && class_range_state == RANGE_FORBID_NO) + { + *parsed_pattern++ = CHAR_MINUS; + class_range_state = RANGE_FORBID_STARTED; + class_range_forbid_ptr = ptr; + } + + /* Handle a literal character */ + + else + { + CLASS_LITERAL: + + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR113; + goto FAILED; + } + + if (class_range_state == RANGE_STARTED) + { + if (c == parsed_pattern[-2]) /* Optimize one-char range */ + parsed_pattern--; + else if (parsed_pattern[-2] > c) /* Check range is in order */ + { + errorcode = ERR8; + goto FAILED_BACK; // TODO https://github.com/PCRE2Project/pcre2/issues/549 + } + else + { + if (!char_is_literal && parsed_pattern[-1] == META_RANGE_LITERAL) + parsed_pattern[-1] = META_RANGE_ESCAPED; + PARSED_LITERAL(c, parsed_pattern); + } + class_range_state = RANGE_NO; + class_op_state = CLASS_OP_OPERAND; + } + else if (class_range_state == RANGE_FORBID_STARTED) + { + ptr = class_range_forbid_ptr; + errorcode = ERR50; + goto FAILED; + } + else /* Potential start of range */ + { + class_range_state = char_is_literal? + RANGE_OK_LITERAL : RANGE_OK_ESCAPED; + class_op_state = CLASS_OP_OPERAND; + PARSED_LITERAL(c, parsed_pattern); + } + } + /* Proceed to next thing in the class. */ CLASS_CONTINUE: if (ptr >= ptrend) { - if ((options & PCRE2_ALT_EXTENDED_CLASS) != 0 && + if (class_mode_state == CLASS_MODE_PERL_EXT && class_depth_m1 > 0) + errorcode = ERR14; /* Missing terminating ')' */ + if (class_mode_state == CLASS_MODE_ALT_EXT && class_depth_m1 == 0 && class_maxdepth_m1 == 1) errorcode = ERR112; /* Missing terminating ']', but we saw '[ [ ]...' */ else - errorcode = ERR6; /* Missing terminating ']' */ + errorcode = ERR6; /* Missing terminating ']' */ goto FAILED; } GETCHARINCTEST(c, ptr); @@ -4296,6 +4513,7 @@ while (ptr < ptrend) if (read_number(&ptr, ptrend, cb->bracount, MAX_GROUP_NUMBER, ERR61, &i, &errorcode)) { + PCRE2_ASSERT(i >= 0); if (i <= 0) { errorcode = ERR15; @@ -4743,11 +4961,7 @@ while (ptr < ptrend) (IS_DIGIT(*ptr))? -1:(int)(cb->bracount), /* + and - are relative */ MAX_GROUP_NUMBER, ERR61, &i, &errorcode)) goto FAILED; - if (i < 0) /* NB (?0) is permitted */ - { - errorcode = ERR15; /* Unknown group */ - goto FAILED_BACK; - } + PCRE2_ASSERT(i >= 0); /* NB (?0) is permitted, represented by i=0 */ if (ptr >= ptrend || *ptr != CHAR_RIGHT_PARENTHESIS) goto UNCLOSED_PARENTHESIS; @@ -4939,6 +5153,7 @@ while (ptr < ptrend) if (read_number(&ptr, ptrend, cb->bracount, MAX_GROUP_NUMBER, ERR61, &i, &errorcode)) { + PCRE2_ASSERT(i >= 0); if (i <= 0) { errorcode = ERR15; @@ -5089,7 +5304,7 @@ while (ptr < ptrend) goto POST_ASSERTION; case CHAR_ASTERISK: - POSITIVE_NONATOMIC_LOOK_AHEAD: /* Come from (?* */ + POSITIVE_NONATOMIC_LOOK_AHEAD: /* Come from (*napla: */ *parsed_pattern++ = META_LOOKAHEAD_NA; ptr++; goto POST_ASSERTION; @@ -5254,6 +5469,18 @@ while (ptr < ptrend) cb->named_groups[cb->names_found].isdup = (uint16_t)isdupname; cb->names_found++; break; + + + /* ---- Perl extended character class ---- */ + + /* These are of the form '(?[...])'. We handle these via the same parser + that consumes ordinary '[...]' classes, but with a flag set to activate + the extended behaviour. */ + + case CHAR_LEFT_SQUARE_BRACKET: + class_mode_state = CLASS_MODE_PERL_EXT; + c = *ptr++; + goto FROM_PERL_EXTENDED_CLASS; } /* End of (? switch */ break; /* End of ( handling */ @@ -5300,7 +5527,7 @@ while (ptr < ptrend) if (nest_depth == 0) /* Unmatched closing parenthesis */ { errorcode = ERR22; - goto FAILED_BACK; + goto FAILED_BACK; // TODO https://github.com/PCRE2Project/pcre2/issues/549 } nest_depth--; *parsed_pattern++ = META_KET; diff --git a/src/pcre2_compile.h b/src/pcre2_compile.h index e94ace19c..57a72afe2 100644 --- a/src/pcre2_compile.h +++ b/src/pcre2_compile.h @@ -62,7 +62,7 @@ enum { ERR0 = COMPILE_ERROR_BASE, ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90, ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97, ERR98, ERR99, ERR100, ERR101,ERR102,ERR103,ERR104,ERR105,ERR106,ERR107,ERR108,ERR109,ERR110, - ERR111,ERR112 }; + ERR111,ERR112,ERR113,ERR114,ERR115,ERR116 }; /* Code values for parsed patterns, which are stored in a vector of 32-bit unsigned ints. Values less than META_END are literal data values. The coding @@ -158,11 +158,14 @@ versions. */ #define META_MINMAX_PLUS 0x80420000u /* {n,m}+ repeat */ #define META_MINMAX_QUERY 0x80430000u /* {n,m}? repeat */ -/* These meta codes have no ordering constraints. */ +/* These meta codes must be kept in a group, with the OR/SUB/XOR in +this order. */ -#define META_ECLASS_OR 0x80440000u /* || in a class */ -#define META_ECLASS_AND 0x80450000u /* && in a class */ -#define META_ECLASS_SUB 0x80460000u /* -- in a class */ +#define META_ECLASS_AND 0x80440000u /* && (or &) in a class */ +#define META_ECLASS_OR 0x80450000u /* || (or |, +) in a class */ +#define META_ECLASS_SUB 0x80460000u /* -- (or -) in a class */ +#define META_ECLASS_XOR 0x80470000u /* ~~ (or ^) in a class */ +#define META_ECLASS_NOT 0x80480000u /* ! in a class */ /* Convenience aliases. */ diff --git a/src/pcre2_compile_class.c b/src/pcre2_compile_class.c index 2896e3ca7..ca44983be 100644 --- a/src/pcre2_compile_class.c +++ b/src/pcre2_compile_class.c @@ -61,9 +61,11 @@ b) none of the cases here: case META_CLASS_EMPTY: \ case META_CLASS_EMPTY_NOT: \ case META_CLASS_END: \ - case META_ECLASS_OR: \ case META_ECLASS_AND: \ - case META_ECLASS_SUB: + case META_ECLASS_OR: \ + case META_ECLASS_SUB: \ + case META_ECLASS_XOR: \ + case META_ECLASS_NOT: #else #define CLASS_END_CASES(meta) \ default: @@ -1798,9 +1800,11 @@ while (TRUE) switch (META_CODE(*ptr)) { case META_CLASS_END: - case META_ECLASS_OR: case META_ECLASS_AND: + case META_ECLASS_OR: case META_ECLASS_SUB: + case META_ECLASS_XOR: + case META_ECLASS_NOT: goto DONE; case META_CLASS_EMPTY_NOT: @@ -1865,6 +1869,139 @@ return TRUE; +/* This function consumes unary prefix operators. */ + +static BOOL +compile_class_unary(uint32_t options, uint32_t xoptions, uint32_t **pptr, + PCRE2_UCHAR **pcode, int *errorcodeptr, compile_block *cb, + PCRE2_SIZE *lengthptr) +{ +uint32_t *ptr = *pptr; +PCRE2_UCHAR *code = *pcode; +BOOL negated = FALSE; +#ifdef PCRE2_DEBUG +PCRE2_UCHAR *start_code = *pcode; +#endif + +while (*ptr == META_ECLASS_NOT) + { + ++ptr; + negated = !negated; + } + +/* Because it's a non-empty class, there must be an operand. */ +if (!compile_class_operand(options, xoptions, &ptr, &code, errorcodeptr, cb, + lengthptr)) + return FALSE; + +/* Convert prefix to postfix (RPN). */ +if (negated) + { + if (lengthptr != NULL) + (*lengthptr)++; + else + *code++ = OP_ECLASS_NOT; + } + +PCRE2_ASSERT(lengthptr == NULL || code == start_code); + +*pptr = ptr; +*pcode = code; +return TRUE; +} + + + +/* This function consumes tightly-binding binary operators. */ + +static BOOL +compile_class_binary_tight(uint32_t options, uint32_t xoptions, uint32_t **pptr, + PCRE2_UCHAR **pcode, int *errorcodeptr, compile_block *cb, + PCRE2_SIZE *lengthptr) +{ +uint32_t *ptr = *pptr; +PCRE2_UCHAR *code = *pcode; +#ifdef PCRE2_DEBUG +PCRE2_UCHAR *start_code = *pcode; +#endif + +/* Because it's a non-empty class, there must be an operand at the start. */ +if (!compile_class_unary(options, xoptions, &ptr, &code, errorcodeptr, cb, + lengthptr)) + return FALSE; + +while (*ptr == META_ECLASS_AND) + { + uint32_t op = OP_ECLASS_AND; + ++ptr; + + /* An operand must follow the operator. */ + if (!compile_class_unary(options, xoptions, &ptr, &code, errorcodeptr, cb, + lengthptr)) + return FALSE; + + /* Convert infix to postfix (RPN). */ + if (lengthptr != NULL) + (*lengthptr)++; + else + *code++ = op; + } + +PCRE2_ASSERT(lengthptr == NULL || code == start_code); + +*pptr = ptr; +*pcode = code; +return TRUE; +} + + + +/* This function consumes loosely-binding binary operators. */ + +static BOOL +compile_class_binary_loose(uint32_t options, uint32_t xoptions, uint32_t **pptr, + PCRE2_UCHAR **pcode, int *errorcodeptr, compile_block *cb, + PCRE2_SIZE *lengthptr) +{ +uint32_t *ptr = *pptr; +PCRE2_UCHAR *code = *pcode; +#ifdef PCRE2_DEBUG +PCRE2_UCHAR *start_code = *pcode; +#endif + +/* Because it's a non-empty class, there must be an operand at the start. */ +if (!compile_class_binary_tight(options, xoptions, &ptr, &code, errorcodeptr, + cb, lengthptr)) + return FALSE; + +while (*ptr >= META_ECLASS_OR && *ptr <= META_ECLASS_XOR) + { + uint32_t op = *ptr == META_ECLASS_OR ? OP_ECLASS_OR : + *ptr == META_ECLASS_SUB ? OP_ECLASS_SUB : + OP_ECLASS_XOR; + ++ptr; + + /* An operand must follow the operator. */ + if (!compile_class_binary_tight(options, xoptions, &ptr, &code, errorcodeptr, + cb, lengthptr)) + return FALSE; + + /* Convert infix to postfix (RPN). */ + if (lengthptr != NULL) + (*lengthptr)++; + else + *code++ = op; + } + +PCRE2_ASSERT(lengthptr == NULL || code == start_code); + +*pptr = ptr; +*pcode = code; +return TRUE; +} + + + /* This function converts the META codes in pptr into opcodes written to pcode. The pptr must start at a META_CLASS or META_CLASS_NOT. @@ -1898,29 +2035,10 @@ PCRE2_ASSERT(*ptr == (META_CLASS | CLASS_IS_ECLASS) || negated = *ptr++ == (META_CLASS_NOT | CLASS_IS_ECLASS); /* Because it's a non-empty class, there must be an operand at the start. */ -if (!compile_class_operand(options, xoptions, &ptr, &code, errorcodeptr, cb, - lengthptr)) +if (!compile_class_binary_loose(options, xoptions, &ptr, &code, errorcodeptr, + cb, lengthptr)) return FALSE; -while (*ptr >= META_ECLASS_OR && *ptr <= META_ECLASS_SUB) - { - uint32_t op = *ptr == META_ECLASS_OR ? OP_ECLASS_OR : - *ptr == META_ECLASS_AND ? OP_ECLASS_AND : - OP_ECLASS_SUB; - ++ptr; - - /* An operand must follow the operator. */ - if (!compile_class_operand(options, xoptions, &ptr, &code, errorcodeptr, cb, - lengthptr)) - return FALSE; - - /* Convert infix to postfix (RPN). */ - if (lengthptr != NULL) - (*lengthptr)++; - else - *code++ = op; - } - if (negated) { if (lengthptr != NULL) diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index a674bb7a8..aa1f58cc6 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -191,7 +191,7 @@ static const uint8_t coptable[] = { 0, 0, 0, /* CLOSE, SKIPZERO, DEFINE */ 0, 0, /* \B and \b in UCP mode */ 0, /* ECLASS */ - 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ + 0, 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ }; /* This table identifies those opcodes that inspect a character. It is used to @@ -272,7 +272,7 @@ static const uint8_t poptable[] = { 0, 0, 0, /* CLOSE, SKIPZERO, DEFINE */ 1, 1, /* \B and \b in UCP mode */ 1, /* ECLASS */ - 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ + 0, 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ }; /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W, diff --git a/src/pcre2_error.c b/src/pcre2_error.c index 771b40ac9..5e1c3dbef 100644 --- a/src/pcre2_error.c +++ b/src/pcre2_error.c @@ -196,13 +196,18 @@ static const unsigned char compile_error_texts[] = /* 105 */ "PCRE2_EXTRA_TURKISH_CASING requires UTF in 8-bit mode\0" "PCRE2_EXTRA_TURKISH_CASING and PCRE2_EXTRA_CASELESS_RESTRICT are not compatible\0" - "character classes are too deeply nested\0" + "character class nesting is too deep\0" "invalid operator in character class\0" "unexpected operator in character class (no preceding operand)\0" /* 110 */ "expected operand after operator in character class\0" "brackets needed to clarify operator precedence in character class\0" "missing terminating ] for character class (note '[' must be escaped under PCRE2_ALT_EXTENDED_CLASS)\0" + "unexpected expression in character class (no preceding operator)\0" + "empty expression in character class\0" + /* 115 */ + "unexpected ] with no following parenthesis in (?[...\0" + "unexpected character in (?[...]) character class\0" ; /* Match-time and UTF error texts are in the same format. */ diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h index 55d950425..379d23c95 100644 --- a/src/pcre2_internal.h +++ b/src/pcre2_internal.h @@ -577,7 +577,10 @@ modes. */ /* The maximum nesting depth for Unicode character class sets. Currently fixed. Warning: the interpreter relies on this so it can encode the operand stack in a uint32_t. A nesting limit of 15 implies (15*2+1)=31 -stack operands required. */ +stack operands required, due to the fact that we have two (and only two) +levels of operator precedence. In the UTS#18 syntax, you can write 'x&&y[z]' +and in Perl syntax you can write '(?[ x - y & (z) ])', both of which imply +pushing the match results for x & y to the stack. */ #define ECLASS_NEST_LIMIT 15 @@ -1749,10 +1752,11 @@ enum { /* These are used for "extended classes" such as [a-z -- aeiou]. */ OP_ECLASS, /* 172 */ - OP_ECLASS_OR, /* 173 */ - OP_ECLASS_AND, /* 174 */ + OP_ECLASS_AND, /* 173 */ + OP_ECLASS_OR, /* 174 */ OP_ECLASS_SUB, /* 175 */ - OP_ECLASS_NOT, /* 176 */ + OP_ECLASS_XOR, /* 176 */ + OP_ECLASS_NOT, /* 177 */ /* This is not an opcode, but is used to check that tables indexed by opcode are the correct length, in order to catch updating errors - there have been @@ -1815,7 +1819,7 @@ some cases doesn't actually use these names at all). */ "*THEN", "*THEN", "*COMMIT", "*COMMIT", "*FAIL", \ "*ACCEPT", "*ASSERT_ACCEPT", \ "Close", "Skip zero", "Define", "\\B (ucp)", "\\b (ucp)", \ - "eclass", "||", "&&", "--", "!!" + "eclass", "&&", "||", "--", "~~", "!!" /* This macro defines the length of fixed length operations in the compiled @@ -1916,7 +1920,7 @@ in UTF-8 mode. The code that uses this table must know about such things. */ 1, /* DEFINE */ \ 1, 1, /* \B and \b in UCP mode */ \ 0, /* ECLASS - variable length */ \ - 1, 1, 1, 1 /* ECLASS ops, nested inside ECLASS */ + 1, 1, 1, 1, 1 /* ECLASS ops, nested inside ECLASS */ /* A magic value for OP_RREF to indicate the "any recursion" condition. */ diff --git a/src/pcre2_printint.c b/src/pcre2_printint.c index a1df310c7..54dd32de7 100644 --- a/src/pcre2_printint.c +++ b/src/pcre2_printint.c @@ -966,10 +966,6 @@ for(;;) fprintf(f, " op: &&\n"); ccode += OP_lengths[*ccode]; break; - case OP_ECLASS_NOT: - fprintf(f, " op: ^\n"); - ccode += OP_lengths[*ccode]; - break; case OP_ECLASS_OR: fprintf(f, " op: ||\n"); ccode += OP_lengths[*ccode]; @@ -978,6 +974,14 @@ for(;;) fprintf(f, " op: --\n"); ccode += OP_lengths[*ccode]; break; + case OP_ECLASS_XOR: + fprintf(f, " op: ~~\n"); + ccode += OP_lengths[*ccode]; + break; + case OP_ECLASS_NOT: + fprintf(f, " op: ^\n"); + ccode += OP_lengths[*ccode]; + break; /* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537 Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */ diff --git a/src/pcre2_xclass.c b/src/pcre2_xclass.c index 1d16bddea..6b2b110b2 100644 --- a/src/pcre2_xclass.c +++ b/src/pcre2_xclass.c @@ -474,14 +474,14 @@ while (ptr < data_end) { switch (*ptr) { - case OP_ECLASS_OR: + case OP_ECLASS_AND: ++ptr; - stack = (stack >> 1) | (stack & (uint32_t)1u); + stack = (stack >> 1) & (stack | ~(uint32_t)1u); break; - case OP_ECLASS_AND: + case OP_ECLASS_OR: ++ptr; - stack = (stack >> 1) & (stack | ~(uint32_t)1u); + stack = (stack >> 1) | (stack & (uint32_t)1u); break; case OP_ECLASS_SUB: @@ -489,6 +489,11 @@ while (ptr < data_end) stack = (stack >> 1) & (~stack | ~(uint32_t)1u); break; + case OP_ECLASS_XOR: + ++ptr; + stack = (stack >> 1) ^ (stack & (uint32_t)1u); + break; + case OP_ECLASS_NOT: ++ptr; stack ^= (uint32_t)1u; diff --git a/testdata/testinput1 b/testdata/testinput1 index 1e50369fc..10ee898cd 100644 --- a/testdata/testinput1 +++ b/testdata/testinput1 @@ -6729,4 +6729,311 @@ $/x z \ \ +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\n])/ + \n +\= Expect no match + \\ + n + +/^(?[\x61])b/ + ab +\= Expect no match + b + a + +/^(?[\x61])+b/ + ab + aab +\= Expect no match + b + +/(?[ [[:graph:]] ])/ + a +\= Expect no match + \x01 + +/(?[ [:graph:] ])/ + a +\= Expect no match + \x01 + +/(?[ [[:graph:]\x02] ])/ + a + \x02 +\= Expect no match + \x01 + +/(?[\E\n])/ + \n +\= Expect no match + \\ + E + +/(?[\n \Q\E])/ + \n +\= Expect no match + \\ + Q + +/(?[ ( \x02 + [:graph:] ) | [ \x02 [:graph:] ] ])/ + a + \x02 +\= Expect no match + \x01 + +/(?[ \d ])/ + 1 +\= Expect no match + d + +/(?[[1]])/ + 1 +\= Expect no match + ] + +/(?[[a]])/ + a +\= Expect no match + ] + +/(?[[a-c]])/ + a + b +\= Expect no match + - + ] + +/(?[ [\t] + [\n] ])/ + \t + \n +\= Expect no match + t + \\ + [ + +/(?[ \t + \n ])/ + \t + \n +\= Expect no match + t + \\ + [ + +/(?[ [()] ])/ + ) + ( +\= Expect no match + ] + +/(?[ ( [()] ) ])/ + ) + ( +\= Expect no match + ] + +/(?[ (( [\n\t] )) ])/ + \n + \t +\= Expect no match + ) + ( + t + +# Each syntax element, with unary operator applied to it + +/(?[ !\n ])/ + z +\= Expect no match + \n + +/(?[ !\d ])/ + a +\= Expect no match + 1 + +/(?[ ![:alpha:] ])/ + 1 +\= Expect no match + a + +/(?[ ![\n] ])/ + z +\= Expect no match + \n + +/(?[ !(\n) ])/ + z +\= Expect no match + \n + +/(?[ !!\n ])/ + \n +\= Expect no match + z + +# Each syntax element, as contents of parens + +/(?[ (\n) ])/ + \n +\= Expect no match + z + +/(?[ (\d) ])/ + 1 +\= Expect no match + a + +/(?[ ([:alpha:]) ])/ + a +\= Expect no match + 1 + +/(?[ ([\n]) ])/ + \n +\= Expect no match + z + +/(?[ ((\n)) ])/ + \n +\= Expect no match + z + +/(?[ (!\n) ])/ + z +\= Expect no match + \n + +/(?[ (\n + \t) ])/ + \n + \t +\= Expect no match + z + +# Each syntax element, as LHS of a binary operator + +/(?[ \n & [\n\t] ])/ + \n +\= Expect no match + t + +/(?[ \d & [\d\t] ])/ + 1 +\= Expect no match + a + +/(?[ [:alpha:] & [a-z\t] ])/ + a +\= Expect no match + A + \t + +/(?[ [\n] & [\n\t] ])/ + \n +\= Expect no match + \t + +/(?[ (\n) & [\n\t] ])/ + \n +\= Expect no match + \t + +/(?[ !\n & [^\n\t] ])/ + a +\= Expect no match + \n + \t + +/(?[ \n & [\n\t] + [\d] ])/ + \n + 1 +\= Expect no match + \t + a + +# Each syntax element, as RHS of a binary operator + +/(?[ [\n\t] & \n ])/ + \n +\= Expect no match + t + +/(?[ [\d\t] & \d ])/ + 1 +\= Expect no match + a + +/(?[ [a-z\t] & [:alpha:] ])/ + a +\= Expect no match + A + \t + +/(?[ [\n\t] & [\n] ])/ + \n +\= Expect no match + \t + +/(?[ [\n\t] & (\n) ])/ + \n +\= Expect no match + \t + +/(?[ [^\n\t] & !\n ])/ + a +\= Expect no match + \n + \t + +/(?[ [\d] + \n & [\n\t] ])/ + \n + 1 +\= Expect no match + \t + a + +/(?[ [\d] + \n + [\t] ])/ + \n + \t + 1 +\= Expect no match + a + +# end op surrounding syntax tests + +/(?[ \d + \n ])/ + \n + 1 +\= Expect no match + a + +/(?[ \d | \n ])/ + \n + 1 +\= Expect no match + a + +/(?[ \d - [2] ])/ + 1 + 3 +\= Expect no match + 2 + +/(?[ [AC] ^ [BC] ])/ + A + B +\= Expect no match + C + D + +/(?[ ( [ ^ z ] ) ])/ + j +\= Expect no match + z + +# -------------- + # End of testinput1 diff --git a/testdata/testinput11 b/testdata/testinput11 index 26d976e9e..c7d237a2f 100644 --- a/testdata/testinput11 +++ b/testdata/testinput11 @@ -476,4 +476,29 @@ # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +# META_BIGVALUE tests + +/(?[[\x{80000000}-\x{8000000f}]+\x{8fffffff}])/B + \x{80000002} + \x{8fffffff} +\= Expect no match + \x{7fffffff} + \x{90000000} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +# -------------- + # End of testinput11 diff --git a/testdata/testinput2 b/testdata/testinput2 index 1fbb778e0..e4666bf9d 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -3975,6 +3975,12 @@ no reduction Alfred Winifred +/[[:<:]]+red/B + little red riding hood + red is a colour +\= Expect no match + Alfred + /[a[:<:]] should give error/ /(?=ab\K)/aftertext,allow_lookaround_bsk @@ -6707,6 +6713,34 @@ a)"xI B C +/[[AC]||[BC]]/B,alt_extended_class + A + B + C +\= Expect no match + D + +/[[AC]&&[BC]]/B,alt_extended_class + C +\= Expect no match + A + B + D + +/[[AC]--[BC]]/B,alt_extended_class + A +\= Expect no match + B + C + D + +/[[AC]~~[BC]]/B,alt_extended_class + A + B +\= Expect no match + C + D + /[A[]]]/B,alt_extended_class A ] @@ -6970,8 +7004,24 @@ a)"xI /[a|||b]/alt_extended_class +/[a~~~b]/alt_extended_class + +/[a~~~~b]/alt_extended_class + +/[a~~/alt_extended_class + +/[a~~~/alt_extended_class + +/[a~~~~/alt_extended_class + /[a||b&&c]/alt_extended_class +/[a||b~~c]/alt_extended_class + +/[a~~b&&c]/alt_extended_class + +/[a--b~~c]/alt_extended_class + /[a--b&&c]/alt_extended_class /[a||b--c]/alt_extended_class @@ -6998,6 +7048,120 @@ a)"xI /[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a&&a[z]]]]]]]]]]]]]]]]/alt_extended_class +/[z&/alt_extended_class + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +# allow-empty-class does nothing inside (?[...]) +/(?[ []] ])/B,allow_empty_class + ] + +# bad-escape-is-literal does nothing inside (?[...]) +/[ \j ]/ + +/[ /\ + +/(?[ \j ])/ + +/(?[ /\ + +/[ \j ]/bad_escape_is_literal + j +\= Expect no match + k + +/[ /\bad_escape_is_literal + +/(?[ \j ])/bad_escape_is_literal + +/(?[ /\bad_escape_is_literal + +/(?[ [\j] ])/bad_escape_is_literal + +/(?[ (\j) ])/bad_escape_is_literal + +# We can't test error cases in testinput1 + +/(?[])/ + +/(?[/ + +/(?[]/ + +/(?[\n/ + +/(?[\n]/ + +/(?[\n]z)/ + +/(?[\n] )/ + +/(?[(/ + +/(?[( / + +/(?[(\n/ + +/(?[ \n + () ])/ + +/(?[1])/ + +/(?[a])/ + +/(?[a-c])/ + +/(?[(])/ + +/(?[(\n])/ + +/(?[\n)])/ + +/(?[^\n])/ + +/(?[ \n \t ])/ + +/(?[ \d \t ])/ + +/(?[ [\n] \t ])/ + +/(?[ (\n) \t ])/ + +/(?[ [:alpha:] \t ])/ + +/(?[ \n + \t \d ])/ + +/(?[ !\n \t ])/ + +/(?[ \n [:alpha:] ])/ + +/(?[ \n [\d] ])/ + +/(?[ \n (\t) ])/ + +/(?[ \n !\t ])/ + +/(?[ \n \t ])/ + +/(?[:graph:])/ + +/(?[\Qn\E])/ + +# maximum depth tests + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&\n))))))))))))))])/ + \n +\= Expect no match + a + b + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+([\n]&\n))))))))))))))])/ + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&[\n]))))))))))))))])/ + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+((\n)&\n))))))))))))))])/ + # -------------- /[[:digit:] -Z]/xx @@ -7016,4 +7180,41 @@ a)"xI /[\d-z]/ +/[\d-\w]/ + +/[\Q/ + +/[\Q/\ + +/[\Q\E/ + +/[\Q\n/ + +/[\Q\n]/ + +/[\Q\n/\ + +/[\Q\n\]/ + +/[\Q\n\E/ + +/[\Q\n\E]/ + \\ + n +\= Expect no match + \n + Q + +/[z\Q/ + +/[z\Q/\ + +/[z\Q\E/ + +/[/\ + +/[\n/ + +/[\E/ + # End of testinput2 diff --git a/testdata/testinput4 b/testdata/testinput4 index c0c17e103..8b3e0c7b8 100644 --- a/testdata/testinput4 +++ b/testdata/testinput4 @@ -2949,4 +2949,96 @@ A \x{3a3} +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\p{L} - \p{Lu}])/ + a +\= Expect no match + A + 1 + +/(?[\p{L} & \p{Lu}])/ + A +\= Expect no match + a + 1 + +/(?[[\p{Lu}z] ^ [\p{Ll}G]])/ + A + p +\= Expect no match + G + z + 1 + +/(?[\p{Ll} | \p{Nd}])/ + a + 1 +\= Expect no match + A + +/(?[\p{Ll} + [\p{Nd}]])/ + a + 1 +\= Expect no match + A + +/(?[ ![\p{Nd}z] ])/ + _ + Z +\= Expect no match + 1 + z + +/(?[ \P{Nd} + [2] ])/ + _ + Z + 2 +\= Expect no match + 1 + 3 + +/(?[ ![\P{Nd}] ])/ + 1 + 2 +\= Expect no match + _ + z + +# caseless tests + +/(?[ \p{Lu} ^ \p{Ll} ])/ + a + A +\= Expect no match + _ + 1 + +/(?[ [\p{Lu}1] ^ \p{Ll} ])/i + 1 +\= Expect no match + a + A + _ + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/ + 1 +\= Expect no match + a + A + _ + 2 + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/i + a + A + 1 +\= Expect no match + _ + 2 + +# -------------- + # End of testinput4 diff --git a/testdata/testinput5 b/testdata/testinput5 index 503ffa654..05e7fe613 100644 --- a/testdata/testinput5 +++ b/testdata/testinput5 @@ -2927,6 +2927,34 @@ A 0 +/[[\p{Lu}\p{Ll}]||[\p{Nd}\p{Ll}]]/B,alt_extended_class + A + 1 + c +\= Expect no match + _ + +/[[\p{Lu}\p{Ll}]&&[\p{Nd}\p{Ll}]]/B,alt_extended_class + c +\= Expect no match + A + 1 + _ + +/[[\p{Lu}\p{Ll}]--[\p{Nd}\p{Ll}]]/B,alt_extended_class + A +\= Expect no match + 1 + c + _ + +/[[\p{Lu}\p{Ll}]~~[\p{Nd}\p{Ll}]]/B,alt_extended_class + A + 1 +\= Expect no match + c + _ + /[\pL[]]]/B,alt_extended_class A ] @@ -3178,12 +3206,59 @@ \= Expect no match n_n -/[\p{Nd}||[\pL--\p{Lu}]]/alt_extended_class +/[\p{Nd}||[\pL--\p{Lu}]]/B,alt_extended_class a 0 \= Expect no match C +/[\P{Nd}||2]/B,alt_extended_class + _ + Z + 2 +\= Expect no match + 1 + 3 + +/[^[\P{Nd}]]/B,alt_extended_class + 1 + 2 +\= Expect no match + _ + z + +# caseless tests + +/[\p{Lu}~~\p{Ll}]/B,alt_extended_class + a + A +\= Expect no match + _ + 1 + +/[[\p{Lu}1]~~\p{Ll}]/iB,alt_extended_class + 1 +\= Expect no match + a + A + _ + +/[[\p{Lu}1]&&[\p{Ll}1]]/B,alt_extended_class + 1 +\= Expect no match + a + A + _ + 2 + +/[[\p{Lu}1]&&[\p{Ll}1]]/iB,alt_extended_class + a + A + 1 +\= Expect no match + _ + 2 + /[[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] diff --git a/testdata/testinput6 b/testdata/testinput6 index ac3315b7c..1fbe4ce8d 100644 --- a/testdata/testinput6 +++ b/testdata/testinput6 @@ -5189,4 +5189,15 @@ # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[A]+[B]])/ + A + B +\= Expect no match + [ + ] + +# -------------- + # End of testinput6 diff --git a/testdata/testinput7 b/testdata/testinput7 index d91ea2854..56cd49e3f 100644 --- a/testdata/testinput7 +++ b/testdata/testinput7 @@ -2682,6 +2682,64 @@ >aBBd< >Abb\x{01c5}< +# -------------- + +# EXTENDED CHARACTER CLASSES + +/[\p{Ll}[\p{Nd}]]C/alt_extended_class + aC + 1C +\= Expect no match + [C + +/[[\p{Ll}][\p{Nd}]]/alt_extended_class + a + 1 +\= Expect no match + [ + ] + +/[[\p{Ll}]||[\p{Nd}]]/alt_extended_class + a + 1 +\= Expect no match + C + +/[[^\p{Ll}][\p{Nd}]]/alt_extended_class + 1 + A +\= Expect no match + a + +/[^[\p{Ll}][\p{Nd}]]/alt_extended_class + A +\= Expect no match + a + 1 + +/[^[\p{Ll}]&&[\p{Nd}]]/alt_extended_class + a + 1 + A + +/(?[[\p{Ll}]+[\p{Nd}]])/ + a + 1 +\= Expect no match + [ + ] + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[\p{Ll}Z]&[\p{Lu}a]])/ + a + Z +\= Expect no match + A + z + # -------------------------------------------------------------------------- # End of testinput7 diff --git a/testdata/testoutput1 b/testdata/testoutput1 index 6f9277292..1e20d9d2c 100644 --- a/testdata/testoutput1 +++ b/testdata/testoutput1 @@ -10611,4 +10611,449 @@ No match \ \ No match +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\n])/ + \n + 0: \x0a +\= Expect no match + \\ +No match + n +No match + +/^(?[\x61])b/ + ab + 0: ab +\= Expect no match + b +No match + a +No match + +/^(?[\x61])+b/ + ab + 0: ab + aab + 0: aab +\= Expect no match + b +No match + +/(?[ [[:graph:]] ])/ + a + 0: a +\= Expect no match + \x01 +No match + +/(?[ [:graph:] ])/ + a + 0: a +\= Expect no match + \x01 +No match + +/(?[ [[:graph:]\x02] ])/ + a + 0: a + \x02 + 0: \x02 +\= Expect no match + \x01 +No match + +/(?[\E\n])/ + \n + 0: \x0a +\= Expect no match + \\ +No match + E +No match + +/(?[\n \Q\E])/ + \n + 0: \x0a +\= Expect no match + \\ +No match + Q +No match + +/(?[ ( \x02 + [:graph:] ) | [ \x02 [:graph:] ] ])/ + a + 0: a + \x02 + 0: \x02 +\= Expect no match + \x01 +No match + +/(?[ \d ])/ + 1 + 0: 1 +\= Expect no match + d +No match + +/(?[[1]])/ + 1 + 0: 1 +\= Expect no match + ] +No match + +/(?[[a]])/ + a + 0: a +\= Expect no match + ] +No match + +/(?[[a-c]])/ + a + 0: a + b + 0: b +\= Expect no match + - +No match + ] +No match + +/(?[ [\t] + [\n] ])/ + \t + 0: \x09 + \n + 0: \x0a +\= Expect no match + t +No match + \\ +No match + [ +No match + +/(?[ \t + \n ])/ + \t + 0: \x09 + \n + 0: \x0a +\= Expect no match + t +No match + \\ +No match + [ +No match + +/(?[ [()] ])/ + ) + 0: ) + ( + 0: ( +\= Expect no match + ] +No match + +/(?[ ( [()] ) ])/ + ) + 0: ) + ( + 0: ( +\= Expect no match + ] +No match + +/(?[ (( [\n\t] )) ])/ + \n + 0: \x0a + \t + 0: \x09 +\= Expect no match + ) +No match + ( +No match + t +No match + +# Each syntax element, with unary operator applied to it + +/(?[ !\n ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ !\d ])/ + a + 0: a +\= Expect no match + 1 +No match + +/(?[ ![:alpha:] ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ ![\n] ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ !(\n) ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ !!\n ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +# Each syntax element, as contents of parens + +/(?[ (\n) ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +/(?[ (\d) ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ ([:alpha:]) ])/ + a + 0: a +\= Expect no match + 1 +No match + +/(?[ ([\n]) ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +/(?[ ((\n)) ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +/(?[ (!\n) ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ (\n + \t) ])/ + \n + 0: \x0a + \t + 0: \x09 +\= Expect no match + z +No match + +# Each syntax element, as LHS of a binary operator + +/(?[ \n & [\n\t] ])/ + \n + 0: \x0a +\= Expect no match + t +No match + +/(?[ \d & [\d\t] ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ [:alpha:] & [a-z\t] ])/ + a + 0: a +\= Expect no match + A +No match + \t +No match + +/(?[ [\n] & [\n\t] ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ (\n) & [\n\t] ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ !\n & [^\n\t] ])/ + a + 0: a +\= Expect no match + \n +No match + \t +No match + +/(?[ \n & [\n\t] + [\d] ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + \t +No match + a +No match + +# Each syntax element, as RHS of a binary operator + +/(?[ [\n\t] & \n ])/ + \n + 0: \x0a +\= Expect no match + t +No match + +/(?[ [\d\t] & \d ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ [a-z\t] & [:alpha:] ])/ + a + 0: a +\= Expect no match + A +No match + \t +No match + +/(?[ [\n\t] & [\n] ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ [\n\t] & (\n) ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ [^\n\t] & !\n ])/ + a + 0: a +\= Expect no match + \n +No match + \t +No match + +/(?[ [\d] + \n & [\n\t] ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + \t +No match + a +No match + +/(?[ [\d] + \n + [\t] ])/ + \n + 0: \x0a + \t + 0: \x09 + 1 + 0: 1 +\= Expect no match + a +No match + +# end op surrounding syntax tests + +/(?[ \d + \n ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ \d | \n ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ \d - [2] ])/ + 1 + 0: 1 + 3 + 0: 3 +\= Expect no match + 2 +No match + +/(?[ [AC] ^ [BC] ])/ + A + 0: A + B + 0: B +\= Expect no match + C +No match + D +No match + +/(?[ ( [ ^ z ] ) ])/ + j + 0: j +\= Expect no match + z +No match + +# -------------- + # End of testinput1 diff --git a/testdata/testoutput11-16 b/testdata/testoutput11-16 index 7101d20c9..6cc49d33f 100644 --- a/testdata/testoutput11-16 +++ b/testdata/testoutput11-16 @@ -822,4 +822,32 @@ Failed: error 134 at offset 13: character code point value in \x{} or \o{} is to # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +# META_BIGVALUE tests + +/(?[[\x{80000000}-\x{8000000f}]+\x{8fffffff}])/B +Failed: error 134 at offset 15: character code point value in \x{} or \o{} is too large + \x{80000002} + \x{8fffffff} +\= Expect no match + \x{7fffffff} + \x{90000000} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +Failed: error 134 at offset 15: character code point value in \x{} or \o{} is too large + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +Failed: error 134 at offset 15: character code point value in \x{} or \o{} is too large + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +# -------------- + # End of testinput11 diff --git a/testdata/testoutput11-32 b/testdata/testoutput11-32 index 3d0a75332..8fc94506c 100644 --- a/testdata/testoutput11-32 +++ b/testdata/testoutput11-32 @@ -939,4 +939,69 @@ No match # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +# META_BIGVALUE tests + +/(?[[\x{80000000}-\x{8000000f}]+\x{8fffffff}])/B +------------------------------------------------------------------ + Bra + eclass[ + cls:[\x{80000000}-\x{8000000f}] + cls:[\x{8fffffff}] + op: || + ] + Ket + End +------------------------------------------------------------------ + \x{80000002} + 0: \x{80000002} + \x{8fffffff} + 0: \x{8fffffff} +\= Expect no match + \x{7fffffff} +No match + \x{90000000} +No match + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +------------------------------------------------------------------ + Bra + eclass[ + cls:[\x{80000000}-\x{8000000f}] + cls:[\x{80000002}] + op: -- + ] + Ket + End +------------------------------------------------------------------ + \x{80000001} + 0: \x{80000001} + \x{80000003} + 0: \x{80000003} +\= Expect no match + \x{80000002} +No match + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +------------------------------------------------------------------ + Bra + eclass[ + cls:[\x{80000000}-\x{8000000f}] + cls:[\x{80000002}] + op: -- + ] + Ket + End +------------------------------------------------------------------ + \x{80000001} + 0: \x{80000001} + \x{80000003} + 0: \x{80000003} +\= Expect no match + \x{80000002} +No match + +# -------------- + # End of testinput11 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index 99714596e..3572e3258 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -13476,6 +13476,29 @@ No match Alfred Winifred No match +/[[:<:]]+red/B +------------------------------------------------------------------ + Bra + \b + Assert + \w + Ket + Brazero + Assert + \w + Ket + red + Ket + End +------------------------------------------------------------------ + little red riding hood + 0: red + red is a colour + 0: red +\= Expect no match + Alfred +No match + /[a[:<:]] should give error/ Failed: error 130 at offset 7: unknown POSIX class name @@ -19744,6 +19767,90 @@ No match C 0: C +/[[AC]||[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: || + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + B + 0: B + C + 0: C +\= Expect no match + D +No match + +/[[AC]&&[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: && + ] + Ket + End +------------------------------------------------------------------ + C + 0: C +\= Expect no match + A +No match + B +No match + D +No match + +/[[AC]--[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: -- + ] + Ket + End +------------------------------------------------------------------ + A + 0: A +\= Expect no match + B +No match + C +No match + D +No match + +/[[AC]~~[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + B + 0: B +\= Expect no match + C +No match + D +No match + /[A[]]]/B,alt_extended_class ------------------------------------------------------------------ Bra @@ -20600,9 +20707,33 @@ Failed: error 208 at offset 5: invalid operator in character class /[a|||b]/alt_extended_class Failed: error 208 at offset 5: invalid operator in character class +/[a~~~b]/alt_extended_class +Failed: error 208 at offset 5: invalid operator in character class + +/[a~~~~b]/alt_extended_class +Failed: error 208 at offset 6: invalid operator in character class + +/[a~~/alt_extended_class +Failed: error 106 at offset 4: missing terminating ] for character class + +/[a~~~/alt_extended_class +Failed: error 208 at offset 5: invalid operator in character class + +/[a~~~~/alt_extended_class +Failed: error 208 at offset 6: invalid operator in character class + /[a||b&&c]/alt_extended_class Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class +/[a||b~~c]/alt_extended_class +Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class + +/[a~~b&&c]/alt_extended_class +Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class + +/[a--b~~c]/alt_extended_class +Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class + /[a--b&&c]/alt_extended_class Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class @@ -20646,10 +20777,183 @@ Failed: error 150 at offset 5: invalid range in character class No match /[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[z]&&a]]]]]]]]]]]]]]]/alt_extended_class -Failed: error 207 at offset 115: character classes are too deeply nested +Failed: error 207 at offset 115: character class nesting is too deep /[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a&&a[z]]]]]]]]]]]]]]]]/alt_extended_class -Failed: error 207 at offset 118: character classes are too deeply nested +Failed: error 207 at offset 118: character class nesting is too deep + +/[z&/alt_extended_class +Failed: error 106 at offset 3: missing terminating ] for character class + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +# allow-empty-class does nothing inside (?[...]) +/(?[ []] ])/B,allow_empty_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\]] + ] + Ket + End +------------------------------------------------------------------ + ] + 0: ] + +# bad-escape-is-literal does nothing inside (?[...]) +/[ \j ]/ +Failed: error 103 at offset 3: unrecognized character follows \ + +/[ /\ +Failed: error 101 at offset 3: \ at end of pattern + +/(?[ \j ])/ +Failed: error 103 at offset 5: unrecognized character follows \ + +/(?[ /\ +Failed: error 101 at offset 5: \ at end of pattern + +/[ \j ]/bad_escape_is_literal + j + 0: j +\= Expect no match + k +No match + +/[ /\bad_escape_is_literal +Failed: error 106 at offset 3: missing terminating ] for character class + +/(?[ \j ])/bad_escape_is_literal +Failed: error 103 at offset 5: unrecognized character follows \ + +/(?[ /\bad_escape_is_literal +Failed: error 101 at offset 5: \ at end of pattern + +/(?[ [\j] ])/bad_escape_is_literal +Failed: error 103 at offset 6: unrecognized character follows \ + +/(?[ (\j) ])/bad_escape_is_literal +Failed: error 103 at offset 6: unrecognized character follows \ + +# We can't test error cases in testinput1 + +/(?[])/ +Failed: error 214 at offset 4: empty expression in character class + +/(?[/ +Failed: error 106 at offset 3: missing terminating ] for character class + +/(?[]/ +Failed: error 214 at offset 4: empty expression in character class + +/(?[\n/ +Failed: error 106 at offset 5: missing terminating ] for character class + +/(?[\n]/ +Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[... + +/(?[\n]z)/ +Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[... + +/(?[\n] )/ +Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[... + +/(?[(/ +Failed: error 114 at offset 4: missing closing parenthesis + +/(?[( / +Failed: error 106 at offset 5: missing terminating ] for character class + +/(?[(\n/ +Failed: error 106 at offset 6: missing terminating ] for character class + +/(?[ \n + () ])/ +Failed: error 214 at offset 11: empty expression in character class + +/(?[1])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[a])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[a-c])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[(])/ +Failed: error 114 at offset 4: missing closing parenthesis + +/(?[(\n])/ +Failed: error 114 at offset 6: missing closing parenthesis + +/(?[\n)])/ +Failed: error 122 at offset 6: unmatched closing parenthesis + +/(?[^\n])/ +Failed: error 209 at offset 4: unexpected operator in character class (no preceding operand) + +/(?[ \n \t ])/ +Failed: error 213 at offset 9: unexpected expression in character class (no preceding operator) + +/(?[ \d \t ])/ +Failed: error 213 at offset 9: unexpected expression in character class (no preceding operator) + +/(?[ [\n] \t ])/ +Failed: error 213 at offset 11: unexpected expression in character class (no preceding operator) + +/(?[ (\n) \t ])/ +Failed: error 213 at offset 11: unexpected expression in character class (no preceding operator) + +/(?[ [:alpha:] \t ])/ +Failed: error 213 at offset 16: unexpected expression in character class (no preceding operator) + +/(?[ \n + \t \d ])/ +Failed: error 213 at offset 14: unexpected expression in character class (no preceding operator) + +/(?[ !\n \t ])/ +Failed: error 213 at offset 10: unexpected expression in character class (no preceding operator) + +/(?[ \n [:alpha:] ])/ +Failed: error 213 at offset 16: unexpected expression in character class (no preceding operator) + +/(?[ \n [\d] ])/ +Failed: error 213 at offset 8: unexpected expression in character class (no preceding operator) + +/(?[ \n (\t) ])/ +Failed: error 213 at offset 8: unexpected expression in character class (no preceding operator) + +/(?[ \n !\t ])/ +Failed: error 213 at offset 8: unexpected expression in character class (no preceding operator) + +/(?[ \n \t ])/ +Failed: error 213 at offset 9: unexpected expression in character class (no preceding operator) + +/(?[:graph:])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[\Qn\E])/ +Failed: error 216 at offset 6: unexpected character in (?[...]) character class + +# maximum depth tests + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&\n))))))))))))))])/ + \n + 0: \x0a +\= Expect no match + a +No match + b +No match + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+([\n]&\n))))))))))))))])/ +Failed: error 207 at offset 158: character class nesting is too deep + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&[\n]))))))))))))))])/ +Failed: error 207 at offset 161: character class nesting is too deep + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+((\n)&\n))))))))))))))])/ +Failed: error 207 at offset 158: character class nesting is too deep # -------------- @@ -20677,6 +20981,62 @@ Failed: error 150 at offset 11: invalid range in character class /[\d-z]/ Failed: error 150 at offset 4: invalid range in character class +/[\d-\w]/ +Failed: error 150 at offset 4: invalid range in character class + +/[\Q/ +Failed: error 106 at offset 3: missing terminating ] for character class + +/[\Q/\ +Failed: error 106 at offset 4: missing terminating ] for character class + +/[\Q\E/ +Failed: error 106 at offset 5: missing terminating ] for character class + +/[\Q\n/ +Failed: error 106 at offset 5: missing terminating ] for character class + +/[\Q\n]/ +Failed: error 106 at offset 6: missing terminating ] for character class + +/[\Q\n/\ +Failed: error 106 at offset 6: missing terminating ] for character class + +/[\Q\n\]/ +Failed: error 106 at offset 7: missing terminating ] for character class + +/[\Q\n\E/ +Failed: error 106 at offset 7: missing terminating ] for character class + +/[\Q\n\E]/ + \\ + 0: \ + n + 0: n +\= Expect no match + \n +No match + Q +No match + +/[z\Q/ +Failed: error 106 at offset 4: missing terminating ] for character class + +/[z\Q/\ +Failed: error 106 at offset 5: missing terminating ] for character class + +/[z\Q\E/ +Failed: error 106 at offset 6: missing terminating ] for character class + +/[/\ +Failed: error 101 at offset 2: \ at end of pattern + +/[\n/ +Failed: error 106 at offset 3: missing terminating ] for character class + +/[\E/ +Failed: error 106 at offset 3: missing terminating ] for character class + # End of testinput2 Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -62: bad serialized data diff --git a/testdata/testoutput4 b/testdata/testoutput4 index a23250314..fcde19f12 100644 --- a/testdata/testoutput4 +++ b/testdata/testoutput4 @@ -4722,4 +4722,144 @@ No match \x{3a3} No match +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\p{L} - \p{Lu}])/ + a + 0: a +\= Expect no match + A +No match + 1 +No match + +/(?[\p{L} & \p{Lu}])/ + A + 0: A +\= Expect no match + a +No match + 1 +No match + +/(?[[\p{Lu}z] ^ [\p{Ll}G]])/ + A + 0: A + p + 0: p +\= Expect no match + G +No match + z +No match + 1 +No match + +/(?[\p{Ll} | \p{Nd}])/ + a + 0: a + 1 + 0: 1 +\= Expect no match + A +No match + +/(?[\p{Ll} + [\p{Nd}]])/ + a + 0: a + 1 + 0: 1 +\= Expect no match + A +No match + +/(?[ ![\p{Nd}z] ])/ + _ + 0: _ + Z + 0: Z +\= Expect no match + 1 +No match + z +No match + +/(?[ \P{Nd} + [2] ])/ + _ + 0: _ + Z + 0: Z + 2 + 0: 2 +\= Expect no match + 1 +No match + 3 +No match + +/(?[ ![\P{Nd}] ])/ + 1 + 0: 1 + 2 + 0: 2 +\= Expect no match + _ +No match + z +No match + +# caseless tests + +/(?[ \p{Lu} ^ \p{Ll} ])/ + a + 0: a + A + 0: A +\= Expect no match + _ +No match + 1 +No match + +/(?[ [\p{Lu}1] ^ \p{Ll} ])/i + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/ + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + 2 +No match + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/i + a + 0: a + A + 0: A + 1 + 0: 1 +\= Expect no match + _ +No match + 2 +No match + +# -------------- + # End of testinput4 diff --git a/testdata/testoutput5 b/testdata/testoutput5 index 00743f193..29dbd1678 100644 --- a/testdata/testoutput5 +++ b/testdata/testoutput5 @@ -6351,6 +6351,90 @@ No match 0 0: 0 +/[[\p{Lu}\p{Ll}]||[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: || + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + 1 + 0: 1 + c + 0: c +\= Expect no match + _ +No match + +/[[\p{Lu}\p{Ll}]&&[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: && + ] + Ket + End +------------------------------------------------------------------ + c + 0: c +\= Expect no match + A +No match + 1 +No match + _ +No match + +/[[\p{Lu}\p{Ll}]--[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: -- + ] + Ket + End +------------------------------------------------------------------ + A + 0: A +\= Expect no match + 1 +No match + c +No match + _ +No match + +/[[\p{Lu}\p{Ll}]~~[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + 1 + 0: 1 +\= Expect no match + c +No match + _ +No match + /[\pL[]]]/B,alt_extended_class ------------------------------------------------------------------ Bra @@ -7158,7 +7242,19 @@ No match n_n No match -/[\p{Nd}||[\pL--\p{Lu}]]/alt_extended_class +/[\p{Nd}||[\pL--\p{Lu}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Nd}] + cls:[\p{L}] + cls:[\p{Lu}] + op: -- + op: || + ] + Ket + End +------------------------------------------------------------------ a 0: a 0 @@ -7167,6 +7263,139 @@ No match C No match +/[\P{Nd}||2]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\P{Nd}] + cls:[2] + op: || + ] + Ket + End +------------------------------------------------------------------ + _ + 0: _ + Z + 0: Z + 2 + 0: 2 +\= Expect no match + 1 +No match + 3 +No match + +/[^[\P{Nd}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\P{Nd}] + op: ^ + ] + Ket + End +------------------------------------------------------------------ + 1 + 0: 1 + 2 + 0: 2 +\= Expect no match + _ +No match + z +No match + +# caseless tests + +/[\p{Lu}~~\p{Ll}]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}] + cls:[\p{Ll}] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + a + 0: a + A + 0: A +\= Expect no match + _ +No match + 1 +No match + +/[[\p{Lu}1]~~\p{Ll}]/iB,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[1A-Za-z\xb5\xc0-\xd6\xd8-\xf6\xf8-\xff\p{Lc}] + cls:[\p{Lc}] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + +/[[\p{Lu}1]&&[\p{Ll}1]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[1A-Z\xc0-\xd6\xd8-\xde\p{Lu}] + cls:[1a-z\xb5\xdf-\xf6\xf8-\xff\p{Ll}] + op: && + ] + Ket + End +------------------------------------------------------------------ + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + 2 +No match + +/[[\p{Lu}1]&&[\p{Ll}1]]/iB,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[1A-Za-z\xb5\xc0-\xd6\xd8-\xf6\xf8-\xff\p{Lc}] + cls:[1A-Za-z\xb5\xc0-\xd6\xd8-\xf6\xf8-\xff\p{Lc}] + op: && + ] + Ket + End +------------------------------------------------------------------ + a + 0: a + A + 0: A + 1 + 0: 1 +\= Expect no match + _ +No match + 2 +No match + /[[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] diff --git a/testdata/testoutput6 b/testdata/testoutput6 index 757e49db6..74c40e904 100644 --- a/testdata/testoutput6 +++ b/testdata/testoutput6 @@ -8149,4 +8149,19 @@ No match # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[A]+[B]])/ + A + 0: A + B + 0: B +\= Expect no match + [ +No match + ] +No match + +# -------------- + # End of testinput6 diff --git a/testdata/testoutput7 b/testdata/testoutput7 index eb27100d2..caf9eb25f 100644 --- a/testdata/testoutput7 +++ b/testdata/testoutput7 @@ -4462,6 +4462,91 @@ No match >Abb\x{01c5}< No match +# -------------- + +# EXTENDED CHARACTER CLASSES + +/[\p{Ll}[\p{Nd}]]C/alt_extended_class + aC + 0: aC + 1C + 0: 1C +\= Expect no match + [C +No match + +/[[\p{Ll}][\p{Nd}]]/alt_extended_class + a + 0: a + 1 + 0: 1 +\= Expect no match + [ +No match + ] +No match + +/[[\p{Ll}]||[\p{Nd}]]/alt_extended_class + a + 0: a + 1 + 0: 1 +\= Expect no match + C +No match + +/[[^\p{Ll}][\p{Nd}]]/alt_extended_class + 1 + 0: 1 + A + 0: A +\= Expect no match + a +No match + +/[^[\p{Ll}][\p{Nd}]]/alt_extended_class + A + 0: A +\= Expect no match + a +No match + 1 +No match + +/[^[\p{Ll}]&&[\p{Nd}]]/alt_extended_class + a + 0: a + 1 + 0: 1 + A + 0: A + +/(?[[\p{Ll}]+[\p{Nd}]])/ + a + 0: a + 1 + 0: 1 +\= Expect no match + [ +No match + ] +No match + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[\p{Ll}Z]&[\p{Lu}a]])/ + a + 0: a + Z + 0: Z +\= Expect no match + A +No match + z +No match + # -------------------------------------------------------------------------- # End of testinput7