From 70c1a07c96e822103fd0a7fc49e0f025c20bab22 Mon Sep 17 00:00:00 2001
From: Nicholas Wilson
@@ -1555,38 +1556,82 @@ pcre2pattern man page
PCRE2 REGULAR EXPRESSION DETAILS
pcre2pattern man page
the next two sections), and the terminating closing square bracket. However,
escaping other non-alphanumeric characters does no harm.
-The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's "(?[...])" -syntax, allowing instead extended class behaviour inside ordinary "[...]" -character classes. This altered syntax for "[...]" classes is loosely described -by the Unicode standard UTS#18. +PCRE2 supports Perl's (?[...]) extended character class syntax. This can +be used to perform set operations, such intersection.
-Firstly, in Perl syntax, an expression such as "[a[]" is a character class -with two literal characters "a" and "[", but in UTS#18 extended classes the "[" -character becomes an additional metacharacter within classes, denoting the start -of a nested class, so a literal "[" must be escaped as "\[". +The syntax permitted within (?[...]) is quite different to ordinary character +classes. Inside the extended class, there is an expression syntax consisting of +"atoms", operators, and ordinary parentheses "()" used for grouping.
-Secondly, within the UTS#18 extended syntax, there are additional operators -"||", "&&" and "--" which denote character class union, intersection, and -subtraction respectively. In standard Perl syntax, these would simply be -needlessly-repeated literals (except for "-" which can denote a range). These -operators can be used in constructs such as "[\p{L}--[QW]]" for "Unicode -letters, other than Q and W". A literal "-" at the end of a range must be -escaped (so while "[--1]" in Perl syntax is the range from hyphen to "1", it -must be escaped as "[\--1]" in UTS#18 extended classes). +The allowed atoms are any escaped characters or sets such as \n or \d, POSIX +classes such as [:alpha:], and any ordinary character class may be nested as an +atom within an extended class. For example, in (?[\d & [...]]) the nested +ordinary class [...] follows the ordinary rules for character classes, in which +parentheses are not metacharacters, and character literals and ranges are +permitted. +
++However, when outside a nested ordinary character class, such as in +(?[... + (...)]), character literals and ranges may not be used, as they are not +atoms in the extended syntax. The extended syntax does not introduce any +additional escape sequences, so (?[\y]) is an unknown escape, as it would be +inside [\y]. +
++In the extended syntax, ^ does not negate a class (except within an +ordinary class nested inside an extended class); it is instead a binary +operator. +
++The binary operators are "&" (intersection), "|" or "+" (union), "-" +(subtraction) and "^" (symmetric difference). These are left-associative and +"&" has higher (tighter) precedence, while the others have equal lower +precedence. The one prefix unary operator is "!" (complement), with highest +precedence.
-The specific rules in PCRE2 are that classes can be nested: -"[...[B]...[^C]...]". The individual class items (literal characters, literal +A Perl extended character class always has the /xx modifier turned on within +it. +
++The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's (?[...]) +syntax, allowing instead extended class behaviour inside ordinary [...] +character classes. This altered syntax for [...] classes is loosely described +by the Unicode standard UTS#18. (The PCRE2_ALT_EXTENDED_CLASS option does not +prevent use of (?[...]) classes, but only changes the meaning of [...] +classes.) +
++Firstly, in ordinary Perl [...] syntax, an expression such as "[a[]" is a +character class with two literal characters "a" and "[", but in UTS#18 extended +classes the "[" character becomes an additional metacharacter within classes, +denoting the start of a nested class, so a literal "[" must be escaped as "\[". +
++Secondly, within the UTS#18 extended syntax, there are additional operators +"||", "&&", "--" and "~~" which denote character class union, intersection, +subtraction, and symmetric difference respectively. In standard Perl syntax, +these would simply be needlessly-repeated literals (except for "-" which can +denote a range). These operators can be used in constructs such as +[\p{L}--[QW]] for "Unicode letters, other than Q and W". A literal "-" at +the end of a range must be escaped (so while "[--1]" in Perl syntax is the +range from hyphen to "1", it must be escaped as "[\--1]" in UTS#18 extended +classes). +
++The specific rules in PCRE2 are: Classes can be nested, for example +[...[B]...[^C]...]. The individual class items (literal characters, literal ranges, properties such as \d or \p{...}, and nested classes) can be -combined by juxtaposition or by an operator "||", "&&", or "--". +combined by juxtaposition or by an operator "||", "&&", "--", or "~~". Juxtaposition is the implicit union operator, and binds more tightly than any explicit operator. Precedence between the explicit operators is not defined, -so mixing operators is a syntax error (thus "[A&&B--C]" is an error, but -"[A&&[B--C]]" is accepted). +so mixing operators is a syntax error (thus [A&&B--C] is an error, but +[A&&[B--C]] is accepted).
This is an emerging syntax which is being adopted gradually across the regex @@ -1595,8 +1640,8 @@
PCRE2's syntax adds syntax restrictions similar to ECMASCript's /v flag, so @@ -1605,7 +1650,7 @@
Perl supports the POSIX notation for character classes. This uses names enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports @@ -1709,7 +1754,7 @@
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of @@ -1730,7 +1775,7 @@
Vertical bar characters are used to separate alternative patterns. For example, the pattern @@ -1745,7 +1790,7 @@
The settings of several options can be changed within a pattern by a sequence of letters enclosed between "(?" and ")". The following are Perl-compatible, @@ -1843,7 +1888,7 @@
Groups are delimited by parentheses (round brackets), which can be nested. Turning part of a pattern into a group does two things: @@ -1899,7 +1944,7 @@
Perl 5.10 introduced a feature whereby each alternative in a group uses the same numbers for its capturing parentheses. Such a group starts with (?| and is @@ -1945,7 +1990,7 @@
Identifying capture groups by number is simple, but it can be very hard to keep track of the numbers in complicated patterns. Furthermore, if an expression is @@ -2065,7 +2110,7 @@
Repetition is specified by quantifiers, which may follow any one of these items: @@ -2247,7 +2292,7 @@
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") repetition, failure of what follows normally causes the repeated item to be @@ -2358,7 +2403,7 @@
Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a backreference to a capture group earlier (that @@ -2496,7 +2541,7 @@
An assertion is a test that does not consume any characters. The test must succeed for the match to continue. The simple assertions coded as \b, \B, @@ -2746,7 +2791,7 @@
Traditional lookaround assertions are atomic. That is, if an assertion is true, but there is a subsequent matching failure, there is no backtracking into the @@ -2806,7 +2851,7 @@
A special kind of assertion, not compatible with Perl, makes it possible to check the contents of a captured substring by matching it with a subpattern. @@ -2865,7 +2910,7 @@
In concept, a script run is a sequence of characters that are all from the same Unicode script such as Latin or Greek. However, because some scripts are @@ -2927,7 +2972,7 @@
It is possible to cause the matching process to obey a pattern fragment conditionally or to choose between two alternative fragments, depending on @@ -3128,7 +3173,7 @@
There are two ways of including comments in patterns that are processed by PCRE2. In both cases, the start of the comment must not be in a character @@ -3158,7 +3203,7 @@
Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can @@ -3346,7 +3391,7 @@
If the syntax for a recursive group call (either by number or by name) is used outside the parentheses to which it refers, it operates a bit like a subroutine @@ -3394,7 +3439,7 @@
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative @@ -3412,7 +3457,7 @@
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it @@ -3490,7 +3535,7 @@
There are a number of special "Backtracking Control Verbs" (to use Perl's terminology) that modify the behaviour of backtracking during matching. They @@ -4017,12 +4062,12 @@
pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3).
-
Philip Hazel
@@ -4031,9 +4076,9 @@
-Last updated: 21 September 2024
+Last updated: 08 November 2024
Copyright © 1997-2024 University of Cambridge.
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
index 91386ad97..a6d89b94e 100644
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -24,29 +24,30 @@
@@ -311,7 +312,45 @@
+When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be +used, allowing nested character classes, combined using set operators. +
+ [x&&[^y]] UTS#18 extended character class + + x||y set union (OR) + x&&y set intersection (AND) + x--y set difference (AND NOT) + x~~y set symmetric difference (XOR) + ++ +
+
+ (?[...]) Perl extended character class + (?[\p{Thai} & \p{Nd}]) operators; whitespace ignored + (?[(x - y) & z]) parentheses for grouping + + (?[ [^3] & \p{Nd} ]) [...] is a nested ordinary class + (?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...] + (?[ \d - [3] ]) backslash-escaped set is allowed outside [...] + (?[ !\n & [:ascii:] ]) backslash-escaped character is allowed outside [...] + all other characters or ranges must be enclosed in [...] + + x|y, x+y set union (OR) + x&y set intersection (AND) + x-y set difference (AND NOT) + x^y set symmetric difference (XOR) + !x set complement (NOT) ++Inside a Perl extended character class, [...] switches mode to be interpreted +as an ordinary character class. Outside of a nested [...], the only items +permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside +a nested ordinary class, ^ has its usual meaning (inverts the class when used +as the first character); outside of a nested class, ^ is the XOR operator. + +
? 0 or 1, greedy @@ -335,7 +374,7 @@-pcre2syntax man page
{,m}? zero up to m, lazy
\b word boundary @@ -353,7 +392,7 @@-pcre2syntax man page
\G first matching position in subject
\K set reported start of match @@ -363,13 +402,13 @@-pcre2syntax man page
option is set, the previous behaviour is re-enabled. When this option is set, \K is honoured in positive assertions, but ignored in negative ones. -
ALTERNATION
+
ALTERNATION
expr|expr|expr...-
CAPTURING
+
CAPTURING
(...) capture group @@ -384,20 +423,20 @@-pcre2syntax man page
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In both cases, a name must not start with a digit. -
ATOMIC GROUPS
+
ATOMIC GROUPS
(?>...) atomic non-capture group (*atomic:...) atomic non-capture group-
COMMENT
+
COMMENT
(?#....) comment (not nestable)-
OPTION SETTING
+
OPTION SETTING
Changes of these options within a group are automatically cancelled at the end of the group. @@ -456,7 +495,7 @@
pcre2syntax man page
application can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. -
NEWLINE CONVENTION
+
NEWLINE CONVENTION
These are recognized only at the very start of the pattern or after option settings with a similar syntax. @@ -469,7 +508,7 @@
pcre2syntax man page
(*NUL) the NUL character (binary zero)
WHAT \R MATCHES
+
WHAT \R MATCHES
These are recognized only at the very start of the pattern or after option setting with a similar syntax. @@ -478,7 +517,7 @@
pcre2syntax man page
(*BSR_UNICODE) any Unicode newline sequence
(?=...) ) @@ -504,7 +543,7 @@pcre2syntax man page
(ultimate default 255). If every branch matches a fixed number of characters, the limit for each branch is 65535 characters. -
NON-ATOMIC LOOKAROUND ASSERTIONS
+
NON-ATOMIC LOOKAROUND ASSERTIONS
These assertions are specific to PCRE2 and are not Perl-compatible.
@@ -517,7 +556,7 @@-pcre2syntax man page
(*non_atomic_positive_lookbehind:...) )
SUBSTRING SCAN ASSERTION
+
SUBSTRING SCAN ASSERTION
This feature is not Perl-compatible.
@@ -534,7 +573,7 @@-pcre2syntax man page
SCRIPT RUNS
+
SCRIPT RUNS
(*script_run:...) ) script run, can be backtracked into @@ -544,7 +583,7 @@-pcre2syntax man page
(*asr:...) )
BACKREFERENCES
+
BACKREFERENCES
\n reference by number (can be ambiguous) @@ -561,7 +600,7 @@-pcre2syntax man page
(?P=name) reference by name (Python)
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
+
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
(?R) recurse whole pattern @@ -580,7 +619,7 @@-pcre2syntax man page
\g'-n' call subroutine by relative number (PCRE2 extension)
CONDITIONAL PATTERNS
+
CONDITIONAL PATTERNS
(?(condition)yes-pattern) @@ -603,7 +642,7 @@pcre2syntax man page
conditions or recursion tests. Such a condition is interpreted as a reference condition if the relevant named group exists. -
BACKTRACKING CONTROL
+
BACKTRACKING CONTROL
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the name is mandatory, for the others it is optional. (*SKIP) changes its behaviour @@ -630,7 +669,7 @@
pcre2syntax man page
The effect of one of these verbs in a group called as a subroutine is confined to the subroutine call. -
CALLOUTS
+
CALLOUTS
(?C) callout (assumed number 0) @@ -641,7 +680,7 @@pcre2syntax man page
start and the end), and the starting delimiter { matched with the ending delimiter }. To encode the ending delimiter within the string, double it. -
REPLACEMENT STRINGS
+
REPLACEMENT STRINGS
If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for pcre2_substitute() is not interpreted. Otherwise, by default, the only @@ -687,12 +726,12 @@
pcre2syntax man page
The substitution strings themselves are expanded. Backslash can be used to escape colons and closing curly brackets. -
SEE ALSO
+
SEE ALSO
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2(3).
-
AUTHOR
+
AUTHOR
Philip Hazel
@@ -701,9 +740,9 @@pcre2syntax man page
Cambridge, England.
-
REVISION
+
REVISION
-Last updated: 20 October 2024 +Last updated: 08 November 2024
Copyright © 1997-2024 University of Cambridge.
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index d17ce1f9b..b61246eb0 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "21 Sepbember 2024" "PCRE2 10.45" +.TH PCRE2PATTERN 3 "08 November 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -1547,35 +1547,74 @@ the next two sections), and the terminating closing square bracket. However, escaping other non-alphanumeric characters does no harm. . . -.SH "UTS#18 EXTENDED CHARACTER CLASSES" +.SH "PERL EXTENDED CHARACTER CLASSES" .rs -The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's "(?[...])" -syntax, allowing instead extended class behaviour inside ordinary "[...]" -character classes. This altered syntax for "[...]" classes is loosely described -by the Unicode standard UTS#18. +PCRE2 supports Perl's (?[...]) extended character class syntax. This can +be used to perform set operations, such intersection. +.P +The syntax permitted within (?[...]) is quite different to ordinary character +classes. Inside the extended class, there is an expression syntax consisting of +"atoms", operators, and ordinary parentheses "()" used for grouping. +.P +The allowed atoms are any escaped characters or sets such as \en or \ed, POSIX +classes such as [:alpha:], and any ordinary character class may be nested as an +atom within an extended class. For example, in (?[\ed & [...]]) the nested +ordinary class [...] follows the ordinary rules for character classes, in which +parentheses are not metacharacters, and character literals and ranges are +permitted. +.P +However, when outside a nested ordinary character class, such as in +(?[... + (...)]), character literals and ranges may not be used, as they are not +atoms in the extended syntax. The extended syntax does not introduce any +additional escape sequences, so (?[\ey]) is an unknown escape, as it would be +inside [\ey]. +.P +In the extended syntax, ^ does not negate a class (except within an +ordinary class nested inside an extended class); it is instead a binary +operator. .P -Firstly, in Perl syntax, an expression such as "[a[]" is a character class -with two literal characters "a" and "[", but in UTS#18 extended classes the "[" -character becomes an additional metacharacter within classes, denoting the start -of a nested class, so a literal "[" must be escaped as "\e[". +The binary operators are "&" (intersection), "|" or "+" (union), "-" +(subtraction) and "^" (symmetric difference). These are left-associative and +"&" has higher (tighter) precedence, while the others have equal lower +precedence. The one prefix unary operator is "!" (complement), with highest +precedence. +.P +A Perl extended character class always has the /xx modifier turned on within +it. +. +. +.SH "UTS#18 EXTENDED CHARACTER CLASSES" +.rs +The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's (?[...]) +syntax, allowing instead extended class behaviour inside ordinary [...] +character classes. This altered syntax for [...] classes is loosely described +by the Unicode standard UTS#18. (The PCRE2_ALT_EXTENDED_CLASS option does not +prevent use of (?[...]) classes, but only changes the meaning of [...] +classes.) +.P +Firstly, in ordinary Perl [...] syntax, an expression such as "[a[]" is a +character class with two literal characters "a" and "[", but in UTS#18 extended +classes the "[" character becomes an additional metacharacter within classes, +denoting the start of a nested class, so a literal "[" must be escaped as "\e[". .P Secondly, within the UTS#18 extended syntax, there are additional operators -"||", "&&" and "--" which denote character class union, intersection, and -subtraction respectively. In standard Perl syntax, these would simply be -needlessly-repeated literals (except for "-" which can denote a range). These -operators can be used in constructs such as "[\ep{L}--[QW]]" for "Unicode -letters, other than Q and W". A literal "-" at the end of a range must be -escaped (so while "[--1]" in Perl syntax is the range from hyphen to "1", it -must be escaped as "[\e--1]" in UTS#18 extended classes). -.P -The specific rules in PCRE2 are that classes can be nested: -"[...[B]...[^C]...]". The individual class items (literal characters, literal +"||", "&&", "--" and "~~" which denote character class union, intersection, +subtraction, and symmetric difference respectively. In standard Perl syntax, +these would simply be needlessly-repeated literals (except for "-" which can +denote a range). These operators can be used in constructs such as +[\ep{L}--[QW]] for "Unicode letters, other than Q and W". A literal "-" at +the end of a range must be escaped (so while "[--1]" in Perl syntax is the +range from hyphen to "1", it must be escaped as "[\e--1]" in UTS#18 extended +classes). +.P +The specific rules in PCRE2 are: Classes can be nested, for example +[...[B]...[^C]...]. The individual class items (literal characters, literal ranges, properties such as \ed or \ep{...}, and nested classes) can be -combined by juxtaposition or by an operator "||", "&&", or "--". +combined by juxtaposition or by an operator "||", "&&", "--", or "~~". Juxtaposition is the implicit union operator, and binds more tightly than any explicit operator. Precedence between the explicit operators is not defined, -so mixing operators is a syntax error (thus "[A&&B--C]" is an error, but -"[A&&[B--C]]" is accepted). +so mixing operators is a syntax error (thus [A&&B--C] is an error, but +[A&&[B--C]] is accepted). .P This is an emerging syntax which is being adopted gradually across the regex ecosystem: for example JavaScript adopted the "/v" flag in ECMAScript 2024; @@ -1583,8 +1622,8 @@ Python's "re" module reserves the syntax for future use with a FutureWarning for unescaped use of "[" as a literal within character classes. Due to UTS#18 providing insufficient guidance, engines interpret the syntax differently. Rust's "regex" crate and Python's "regex" PyPi module both implement UTS#18 -extended classes, but with slight incompatibilities ("[A||B&&C]" is parsed as -"[A||[B&&C]]" in Python's "regex" but as "[[A||B]&&C]" in Rust's "regex"). +extended classes, but with slight incompatibilities ([A||B&&C] is parsed as +[A||[B&&C]] in Python's "regex" but as [[A||B]&&C] in Rust's "regex"). .P PCRE2's syntax adds syntax restrictions similar to ECMASCript's /v flag, so that all the extended classes accepted as valid by PCRE2 have the property @@ -4075,6 +4114,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 21 September 2024 +Last updated: 08 November 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index 4e769f1fb..a1a685dc7 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -1,4 +1,4 @@ -.TH PCRE2SYNTAX 3 "20 October 2024" "PCRE2 10.45" +.TH PCRE2SYNTAX 3 "08 November 2024" "PCRE2 10.45" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" @@ -284,6 +284,43 @@ The recognized classes are: In PCRE2, POSIX character set names recognize only ASCII characters by default, but some of them use Unicode properties if PCRE2_UCP is set. You can use \eQ...\eE inside a character class. +.P +When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be +used, allowing nested character classes, combined using set operators. +.sp + [x&&[^y]] UTS#18 extended character class +.sp + x||y set union (OR) + x&&y set intersection (AND) + x--y set difference (AND NOT) + x~~y set symmetric difference (XOR) +.sp +. +. +.SH "PERL EXTENDED CHARACTER CLASSES" +.rs +.sp + (?[...]) Perl extended character class + (?[\ep{Thai} & \ep{Nd}]) operators; whitespace ignored + (?[(x - y) & z]) parentheses for grouping +.sp + (?[ [^3] & \ep{Nd} ]) [...] is a nested ordinary class + (?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...] + (?[ \ed - [3] ]) backslash-escaped set is allowed outside [...] + (?[ !\en & [:ascii:] ]) backslash-escaped character is allowed outside [...] + all other characters or ranges must be enclosed in [...] +.sp + x|y, x+y set union (OR) + x&y set intersection (AND) + x-y set difference (AND NOT) + x^y set symmetric difference (XOR) + !x set complement (NOT) +.sp +Inside a Perl extended character class, [...] switches mode to be interpreted +as an ordinary character class. Outside of a nested [...], the only items +permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside +a nested ordinary class, ^ has its usual meaning (inverts the class when used +as the first character); outside of a nested class, ^ is the XOR operator. . . .SH "QUANTIFIERS" @@ -692,6 +729,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 20 October 2024 +Last updated: 08 November 2024 Copyright (c) 1997-2024 University of Cambridge. .fi diff --git a/src/pcre2.h.generic b/src/pcre2.h.generic index b4b461e93..af18d4218 100644 --- a/src/pcre2.h.generic +++ b/src/pcre2.h.generic @@ -339,6 +339,10 @@ pcre2_pattern_convert(). */ #define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210 #define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211 #define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213 +#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214 +#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216 /* "Expected" matching error codes: no match and partial match. */ diff --git a/src/pcre2.h.in b/src/pcre2.h.in index 89ccfae56..fff5f2cd1 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -339,6 +339,10 @@ pcre2_pattern_convert(). */ #define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210 #define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211 #define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213 +#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214 +#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215 +#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216 /* "Expected" matching error codes: no match and partial match. */ diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index caf34a66f..b06f716a3 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -260,9 +260,11 @@ static unsigned char meta_extra_lengths[] = { 2, /* META_MINMAX */ 2, /* META_MINMAX_PLUS */ 2, /* META_MINMAX_QUERY */ - 0, /* META_ECLASS_OR */ 0, /* META_ECLASS_AND */ - 0 /* META_ECLASS_SUB */ + 0, /* META_ECLASS_OR */ + 0, /* META_ECLASS_SUB */ + 0, /* META_ECLASS_XOR */ + 0 /* META_ECLASS_NOT */ }; /* Types for skipping parts of a parsed pattern. */ @@ -1077,9 +1079,11 @@ for (;;) fprintf(stderr, ") length=%u", length); break; - case META_ECLASS_OR: fprintf(stderr, "META_ECLASS_OR"); break; case META_ECLASS_AND: fprintf(stderr, "META_ECLASS_AND"); break; + case META_ECLASS_OR: fprintf(stderr, "META_ECLASS_OR"); break; case META_ECLASS_SUB: fprintf(stderr, "META_ECLASS_SUB"); break; + case META_ECLASS_XOR: fprintf(stderr, "META_ECLASS_XOR"); break; + case META_ECLASS_NOT: fprintf(stderr, "META_ECLASS_NOT"); break; } fprintf(stderr, "\n"); } @@ -1273,7 +1277,7 @@ if (allow_sign >= 0 && sign != 0) } if (sign > 0) n += allow_sign; - else if ((int)n > allow_sign) + else if (n > (uint32_t)allow_sign) { *errorcodeptr = ERR15; /* Non-existent subpattern */ goto EXIT; @@ -2826,13 +2830,28 @@ enum { RANGE_STARTED, /* State after '[1-'; last-emitted code is META_RANGE_XYZ */ RANGE_FORBID_NO, /* State after '[\d'; '-]' is allowed but not '-1]' */ RANGE_FORBID_STARTED, /* State after '[\d-'*/ - RANGE_OK_ESCAPED, /* State after '[1'; hyphen may be a range */ - RANGE_OK_LITERAL /* State after '[\1'; hyphen may be a range */ + RANGE_OK_ESCAPED, /* State after '[\1'; hyphen may be a range */ + RANGE_OK_LITERAL /* State after '[1'; hyphen may be a range */ +}; + +/* States used for analyzing operators and operands in extended character +classes. */ + +enum { + CLASS_OP_EMPTY, /* At start of an expression; empty previous contents */ + CLASS_OP_OPERAND, /* Have preceding operand; after "z" a "--" can follow */ + CLASS_OP_OPERATOR /* Have preceding operator; after "--" operand must follow */ }; -/* States used for analyzing operators and operands in character classes. */ +/* States used for determining the parse mode in character classes. The two +PERL_EXT values must be last. */ -enum { CLASS_OP_NONE, CLASS_OP_OPERAND, CLASS_OP_OPERATOR }; +enum { + CLASS_MODE_NORMAL, /* Ordinary PCRE2 '[...]' class. */ + CLASS_MODE_ALT_EXT, /* UTS#18-style extended '[...]' class. */ + CLASS_MODE_PERL_EXT, /* Perl extended '(?[...])' class. */ + CLASS_MODE_PERL_EXT_LEAF /* Leaf within extended '(?[ [...] ])' class. */ +}; /* Only in 32-bit mode can there be literals > META_END. A macro encapsulates the storing of literal values in the main parsed pattern, where they can always @@ -2859,6 +2878,7 @@ uint32_t delimiter; uint32_t namelen; uint32_t class_range_state; uint32_t class_op_state; +uint32_t class_mode_state; uint32_t *class_start; uint32_t *verblengthptr = NULL; /* Value avoids compiler warning */ uint32_t *verbstartptr = NULL; @@ -3543,7 +3563,7 @@ while (ptr < ptrend) if (!prev_okquantifier) { errorcode = ERR9; - goto FAILED_BACK; + goto FAILED_BACK; // TODO https://github.com/PCRE2Project/pcre2/issues/549 } /* Most (*VERB)s are not allowed to be quantified, but an ungreedy @@ -3578,14 +3598,6 @@ while (ptr < ptrend) /* ---- Character class ---- */ case CHAR_LEFT_SQUARE_BRACKET: - okquantifier = TRUE; - - /* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/536 - We shall support Perl's (?[...]) syntax. We need a variable class_perlext = true - and a goto jumping here if we see "(?[...". We need to check for closing "])" and - also implement the completely idiosyncratic nesting and operator rules in this - mode. We can hopefully emit exactly the same META codes as for the UTS#18 - syntax, so that only parser changes are required for the Perl syntax. */ /* In another (POSIX) regex library, the ugly syntax [[:<:]] and [[:>:]] is used for "start of word" and "end of word". As these are otherwise illegal @@ -3623,6 +3635,7 @@ while (ptr < ptrend) } *parsed_pattern++ = META_KET; ptr += 6; + okquantifier = TRUE; break; } @@ -3637,6 +3650,15 @@ while (ptr < ptrend) goto FAILED; } + class_mode_state = ((options & PCRE2_ALT_EXTENDED_CLASS) != 0)? + CLASS_MODE_ALT_EXT : CLASS_MODE_NORMAL; + + /* Jump here from '(?[...])'. That jump must initialize class_mode_state, + set c to the '[' character, and ptr to just after the '['. */ + + FROM_PERL_EXTENDED_CLASS: + okquantifier = TRUE; + /* In an EBCDIC environment, Perl treats alphabetic ranges specially because there are holes in the encoding, and simply using the range A-Z (for example) would include the characters in the holes. This applies only @@ -3646,14 +3668,14 @@ while (ptr < ptrend) ranges. */ /* Loop for the contents of the class. Classes may be nested, if - PCRE2_ALT_EXTENDED_CLASS is set. */ + PCRE2_ALT_EXTENDED_CLASS is set, or the class is of the form (?[...]). */ /* c is still set to '[' so the loop will handle the start of the class. */ class_depth_m1 = -1; class_maxdepth_m1 = -1; class_range_state = RANGE_NO; - class_op_state = CLASS_OP_NONE; + class_op_state = CLASS_OP_EMPTY; class_start = NULL; for (;;) @@ -3670,13 +3692,26 @@ while (ptr < ptrend) ptr++; /* Skip the 'E' */ goto CLASS_CONTINUE; } + + /* Surprisingly, you cannot use \Q..\E to escape a character inside a + Perl extended class. However, empty \Q\E sequences are allowed, so here + were're only giving an error if the \Q..\E is non-empty. */ + + if (class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR116; + goto FAILED; + } + goto CLASS_LITERAL; } - /* Skip over space and tab (only) in extended-more mode. */ + /* Skip over space and tab (only) in extended-more mode, or anywhere + inside a Perl extended class (which implies /xx). */ - if ((options & PCRE2_EXTENDED_MORE) != 0 && - (c == CHAR_SPACE || c == CHAR_HT)) + if ((c == CHAR_SPACE || c == CHAR_HT) && + ((options & PCRE2_EXTENDED_MORE) != 0 || + class_mode_state >= CLASS_MODE_PERL_EXT)) goto CLASS_CONTINUE; /* Handle POSIX class names. Perl allows a negation extension of the @@ -3722,6 +3757,16 @@ while (ptr < ptrend) goto FAILED; } + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + ptr = tempptr + 2; + errorcode = ERR113; + goto FAILED; + } + if (*ptr != CHAR_COLON) { ptr = tempptr + 2; @@ -3793,15 +3838,39 @@ while (ptr < ptrend) /* Check for the start of the outermost class, or the start of a nested class. */ - else if (c == CHAR_LEFT_SQUARE_BRACKET && - (class_depth_m1 < 0 || (options & PCRE2_ALT_EXTENDED_CLASS) != 0)) + else if ((c == CHAR_LEFT_SQUARE_BRACKET && + (class_depth_m1 < 0 || class_mode_state == CLASS_MODE_ALT_EXT || + class_mode_state == CLASS_MODE_PERL_EXT)) || + (c == CHAR_LEFT_PARENTHESIS && + class_mode_state == CLASS_MODE_PERL_EXT)) { + uint32_t start_c = c; + uint32_t new_class_mode_state; + + /* Update the class mode, if moving into a 'leaf' inside a Perl extended + class. */ + + if (start_c == CHAR_LEFT_SQUARE_BRACKET && + class_mode_state == CLASS_MODE_PERL_EXT && class_depth_m1 >= 0) + new_class_mode_state = CLASS_MODE_PERL_EXT_LEAF; + else + new_class_mode_state = class_mode_state; + /* Tidy up the other class before starting the nested class. */ /* -[ beginning a nested class is a literal '-' */ if (class_range_state == RANGE_STARTED) parsed_pattern[-1] = CHAR_MINUS; + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR113; + goto FAILED; + } + /* Validate nesting depth */ if (class_depth_m1 >= ECLASS_NEST_LIMIT - 1) { @@ -3819,12 +3888,16 @@ while (ptr < ptrend) { if (ptr >= ptrend) { - errorcode = ERR6; /* Missing terminating ']' */ + if (start_c == CHAR_LEFT_PARENTHESIS) + errorcode = ERR14; /* Missing terminating ')' */ + else + errorcode = ERR6; /* Missing terminating ']' */ goto FAILED; } GETCHARINCTEST(c, ptr); - if (c == CHAR_BACKSLASH) + if (new_class_mode_state == CLASS_MODE_PERL_EXT) break; + else if (c == CHAR_BACKSLASH) { if (ptr < ptrend && *ptr == CHAR_E) ptr++; else if (ptrend - ptr >= 3 && @@ -3833,8 +3906,9 @@ while (ptr < ptrend) else break; } - else if ((options & PCRE2_EXTENDED_MORE) != 0 && - (c == CHAR_SPACE || c == CHAR_HT)) /* Note: just these two */ + else if ((c == CHAR_SPACE || c == CHAR_HT) && /* Note: just these two */ + ((options & PCRE2_EXTENDED_MORE) != 0 || + new_class_mode_state >= CLASS_MODE_PERL_EXT)) continue; else if (!negate_class && c == CHAR_CIRCUMFLEX_ACCENT) negate_class = TRUE; @@ -3842,11 +3916,15 @@ while (ptr < ptrend) } /* Now the real contents of the class; c has the first "real" character. - Empty classes are permitted only if the option is set. */ + Empty classes are permitted only if the option is set, and if it's not + a Perl-extended class. */ if (c == CHAR_RIGHT_SQUARE_BRACKET && - (cb->external_options & PCRE2_ALLOW_EMPTY_CLASS) != 0) + (cb->external_options & PCRE2_ALLOW_EMPTY_CLASS) != 0 && + new_class_mode_state < CLASS_MODE_PERL_EXT) { + PCRE2_ASSERT(start_c == CHAR_LEFT_SQUARE_BRACKET); + if (class_start != NULL) { PCRE2_ASSERT(class_depth_m1 >= 0); @@ -3879,7 +3957,8 @@ while (ptr < ptrend) class_start = parsed_pattern; *parsed_pattern++ = negate_class? META_CLASS_NOT : META_CLASS; class_range_state = RANGE_NO; - class_op_state = CLASS_OP_NONE; + class_op_state = CLASS_OP_EMPTY; + class_mode_state = new_class_mode_state; ++class_depth_m1; if (class_maxdepth_m1 < class_depth_m1) class_maxdepth_m1 = class_depth_m1; @@ -3887,7 +3966,8 @@ while (ptr < ptrend) cb->class_op_used[class_depth_m1] = 0; /* Implement the special start-of-class literal meaning of ']'. */ - if (c == CHAR_RIGHT_SQUARE_BRACKET) + if (c == CHAR_RIGHT_SQUARE_BRACKET && + new_class_mode_state != CLASS_MODE_PERL_EXT) { class_range_state = RANGE_OK_LITERAL; class_op_state = CLASS_OP_OPERAND; @@ -3900,8 +3980,25 @@ while (ptr < ptrend) /* Check for the end of the class. */ - else if (c == CHAR_RIGHT_SQUARE_BRACKET) + else if (c == CHAR_RIGHT_SQUARE_BRACKET || + (c == CHAR_RIGHT_PARENTHESIS && class_mode_state == CLASS_MODE_PERL_EXT)) { + /* In Perl extended mode, the ']' can only be used to match the + opening '[', and ')' must match an opening parenthesis. */ + if (class_mode_state == CLASS_MODE_PERL_EXT) + { + if (c == CHAR_RIGHT_SQUARE_BRACKET && class_depth_m1 != 0) + { + errorcode = ERR14; + goto FAILED_BACK; + } + if (c == CHAR_RIGHT_PARENTHESIS && class_depth_m1 < 1) + { + errorcode = ERR22; + goto FAILED; + } + } + /* Check no trailing operator. */ if (class_op_state == CLASS_OP_OPERATOR) { @@ -3909,25 +4006,113 @@ while (ptr < ptrend) goto FAILED; } + /* Check no empty expression for Perl extended expressions. */ + if (class_mode_state == CLASS_MODE_PERL_EXT && + class_op_state == CLASS_OP_EMPTY) + { + errorcode = ERR114; + goto FAILED; + } + /* -] at the end of a class is a literal '-' */ if (class_range_state == RANGE_STARTED) parsed_pattern[-1] = CHAR_MINUS; *parsed_pattern++ = META_CLASS_END; - if (--class_depth_m1 < 0) break; + if (--class_depth_m1 < 0) + { + /* Check for and consume ')' after '(?[...]'. */ + PCRE2_ASSERT(class_mode_state != CLASS_MODE_PERL_EXT_LEAF); + if (class_mode_state == CLASS_MODE_PERL_EXT) + { + if (ptr >= ptrend || *ptr != CHAR_RIGHT_PARENTHESIS) + { + errorcode = ERR115; + goto FAILED; + } + + ptr++; + } + + break; + } class_range_state = RANGE_NO; /* for processing the containing class */ class_op_state = CLASS_OP_OPERAND; + if (class_mode_state == CLASS_MODE_PERL_EXT_LEAF) + class_mode_state = CLASS_MODE_PERL_EXT; /* The extended class flag has already been set for the parent class. */ class_start = NULL; } - /* Handle a set operator */ + /* Handle a Perl set binary operator */ + + else if (class_mode_state == CLASS_MODE_PERL_EXT && + (c == CHAR_PLUS || c == CHAR_VERTICAL_LINE || c == CHAR_MINUS || + c == CHAR_AMPERSAND || c == CHAR_CIRCUMFLEX_ACCENT)) + { + /* Check for a preceding operand. */ + if (class_op_state != CLASS_OP_OPERAND) + { + errorcode = ERR109; + goto FAILED; + } + + if (class_start != NULL) + { + PCRE2_ASSERT(class_depth_m1 >= 0); + /* Represents that the class is an extended class. */ + *class_start |= CLASS_IS_ECLASS; + class_start = NULL; + } + + PCRE2_ASSERT(class_range_state != RANGE_STARTED && + class_range_state != RANGE_FORBID_STARTED); + + *parsed_pattern++ = c == CHAR_PLUS? META_ECLASS_OR : + c == CHAR_VERTICAL_LINE? META_ECLASS_OR : + c == CHAR_MINUS? META_ECLASS_SUB : + c == CHAR_AMPERSAND? META_ECLASS_AND : + META_ECLASS_XOR; + class_range_state = RANGE_NO; + class_op_state = CLASS_OP_OPERATOR; + } + + /* Handle a Perl set unary operator */ + + else if (class_mode_state == CLASS_MODE_PERL_EXT && + c == CHAR_EXCLAMATION_MARK) + { + /* Check for no preceding operand. */ + if (class_op_state == CLASS_OP_OPERAND) + { + errorcode = ERR113; + goto FAILED; + } + + if (class_start != NULL) + { + PCRE2_ASSERT(class_depth_m1 >= 0); + /* Represents that the class is an extended class. */ + *class_start |= CLASS_IS_ECLASS; + class_start = NULL; + } + + PCRE2_ASSERT(class_range_state != RANGE_STARTED && + class_range_state != RANGE_FORBID_STARTED); + + *parsed_pattern++ = META_ECLASS_NOT; + class_range_state = RANGE_NO; + class_op_state = CLASS_OP_OPERATOR; + } + + /* Handle a UTS#18 set operator */ - else if ((options & PCRE2_ALT_EXTENDED_CLASS) != 0 && - (c == CHAR_VERTICAL_LINE || c == CHAR_MINUS || c == CHAR_AMPERSAND) && + else if (class_mode_state == CLASS_MODE_ALT_EXT && + (c == CHAR_VERTICAL_LINE || c == CHAR_MINUS || + c == CHAR_AMPERSAND || c == CHAR_TILDE) && ptr < ptrend && *ptr == c) { ++ptr; @@ -3969,71 +4154,16 @@ while (ptr < ptrend) *parsed_pattern++ = c == CHAR_VERTICAL_LINE? META_ECLASS_OR : c == CHAR_MINUS? META_ECLASS_SUB : - META_ECLASS_AND; + c == CHAR_AMPERSAND? META_ECLASS_AND : + META_ECLASS_XOR; class_range_state = RANGE_NO; class_op_state = CLASS_OP_OPERATOR; cb->class_op_used[class_depth_m1] = (uint8_t)c; } - /* Handle potential start of range */ - - else if (c == CHAR_MINUS && class_range_state >= RANGE_OK_ESCAPED) - { - *parsed_pattern++ = (class_range_state == RANGE_OK_LITERAL)? - META_RANGE_LITERAL : META_RANGE_ESCAPED; - class_range_state = RANGE_STARTED; - } - - /* Handle forbidden start of range */ - - else if (c == CHAR_MINUS && class_range_state == RANGE_FORBID_NO) - { - *parsed_pattern++ = CHAR_MINUS; - class_range_state = RANGE_FORBID_STARTED; - class_range_forbid_ptr = ptr; - } - - /* Handle a literal character */ - - else if (c != CHAR_BACKSLASH) - { - CLASS_LITERAL: - if (class_range_state == RANGE_STARTED) - { - if (c == parsed_pattern[-2]) /* Optimize one-char range */ - parsed_pattern--; - else if (parsed_pattern[-2] > c) /* Check range is in order */ - { - errorcode = ERR8; - goto FAILED_BACK; - } - else - { - if (!char_is_literal && parsed_pattern[-1] == META_RANGE_LITERAL) - parsed_pattern[-1] = META_RANGE_ESCAPED; - PARSED_LITERAL(c, parsed_pattern); - } - class_range_state = RANGE_NO; - class_op_state = CLASS_OP_OPERAND; - } - else if (class_range_state == RANGE_FORBID_STARTED) - { - ptr = class_range_forbid_ptr; - errorcode = ERR50; - goto FAILED; - } - else /* Potential start of range */ - { - class_range_state = char_is_literal? - RANGE_OK_LITERAL : RANGE_OK_ESCAPED; - class_op_state = CLASS_OP_OPERAND; - PARSED_LITERAL(c, parsed_pattern); - } - } - /* Handle escapes in a class */ - else + else if (c == CHAR_BACKSLASH) { tempptr = ptr; escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, options, @@ -4041,7 +4171,8 @@ while (ptr < ptrend) if (errorcode != 0) { - if ((xoptions & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0) + if ((xoptions & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) == 0 || + class_mode_state >= CLASS_MODE_PERL_EXT) goto FAILED; ptr = tempptr; if (ptr >= ptrend) c = CHAR_BACKSLASH; else @@ -4151,20 +4282,106 @@ while (ptr < ptrend) goto FAILED; } + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR113; + goto FAILED; + } + class_range_state = RANGE_FORBID_NO; class_op_state = CLASS_OP_OPERAND; } + /* Forbid unescaped literals, and the special meaning of '-', inside a + Perl extended class. */ + + else if (class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR116; + goto FAILED; + } + + /* Handle potential start of range */ + + else if (c == CHAR_MINUS && class_range_state >= RANGE_OK_ESCAPED) + { + *parsed_pattern++ = (class_range_state == RANGE_OK_LITERAL)? + META_RANGE_LITERAL : META_RANGE_ESCAPED; + class_range_state = RANGE_STARTED; + } + + /* Handle forbidden start of range */ + + else if (c == CHAR_MINUS && class_range_state == RANGE_FORBID_NO) + { + *parsed_pattern++ = CHAR_MINUS; + class_range_state = RANGE_FORBID_STARTED; + class_range_forbid_ptr = ptr; + } + + /* Handle a literal character */ + + else + { + CLASS_LITERAL: + + /* Disallow implicit union in Perl extended classes. */ + + if (class_op_state == CLASS_OP_OPERAND && + class_mode_state == CLASS_MODE_PERL_EXT) + { + errorcode = ERR113; + goto FAILED; + } + + if (class_range_state == RANGE_STARTED) + { + if (c == parsed_pattern[-2]) /* Optimize one-char range */ + parsed_pattern--; + else if (parsed_pattern[-2] > c) /* Check range is in order */ + { + errorcode = ERR8; + goto FAILED_BACK; // TODO https://github.com/PCRE2Project/pcre2/issues/549 + } + else + { + if (!char_is_literal && parsed_pattern[-1] == META_RANGE_LITERAL) + parsed_pattern[-1] = META_RANGE_ESCAPED; + PARSED_LITERAL(c, parsed_pattern); + } + class_range_state = RANGE_NO; + class_op_state = CLASS_OP_OPERAND; + } + else if (class_range_state == RANGE_FORBID_STARTED) + { + ptr = class_range_forbid_ptr; + errorcode = ERR50; + goto FAILED; + } + else /* Potential start of range */ + { + class_range_state = char_is_literal? + RANGE_OK_LITERAL : RANGE_OK_ESCAPED; + class_op_state = CLASS_OP_OPERAND; + PARSED_LITERAL(c, parsed_pattern); + } + } + /* Proceed to next thing in the class. */ CLASS_CONTINUE: if (ptr >= ptrend) { - if ((options & PCRE2_ALT_EXTENDED_CLASS) != 0 && + if (class_mode_state == CLASS_MODE_PERL_EXT && class_depth_m1 > 0) + errorcode = ERR14; /* Missing terminating ')' */ + if (class_mode_state == CLASS_MODE_ALT_EXT && class_depth_m1 == 0 && class_maxdepth_m1 == 1) errorcode = ERR112; /* Missing terminating ']', but we saw '[ [ ]...' */ else - errorcode = ERR6; /* Missing terminating ']' */ + errorcode = ERR6; /* Missing terminating ']' */ goto FAILED; } GETCHARINCTEST(c, ptr); @@ -4296,6 +4513,7 @@ while (ptr < ptrend) if (read_number(&ptr, ptrend, cb->bracount, MAX_GROUP_NUMBER, ERR61, &i, &errorcode)) { + PCRE2_ASSERT(i >= 0); if (i <= 0) { errorcode = ERR15; @@ -4743,11 +4961,7 @@ while (ptr < ptrend) (IS_DIGIT(*ptr))? -1:(int)(cb->bracount), /* + and - are relative */ MAX_GROUP_NUMBER, ERR61, &i, &errorcode)) goto FAILED; - if (i < 0) /* NB (?0) is permitted */ - { - errorcode = ERR15; /* Unknown group */ - goto FAILED_BACK; - } + PCRE2_ASSERT(i >= 0); /* NB (?0) is permitted, represented by i=0 */ if (ptr >= ptrend || *ptr != CHAR_RIGHT_PARENTHESIS) goto UNCLOSED_PARENTHESIS; @@ -4939,6 +5153,7 @@ while (ptr < ptrend) if (read_number(&ptr, ptrend, cb->bracount, MAX_GROUP_NUMBER, ERR61, &i, &errorcode)) { + PCRE2_ASSERT(i >= 0); if (i <= 0) { errorcode = ERR15; @@ -5089,7 +5304,7 @@ while (ptr < ptrend) goto POST_ASSERTION; case CHAR_ASTERISK: - POSITIVE_NONATOMIC_LOOK_AHEAD: /* Come from (?* */ + POSITIVE_NONATOMIC_LOOK_AHEAD: /* Come from (*napla: */ *parsed_pattern++ = META_LOOKAHEAD_NA; ptr++; goto POST_ASSERTION; @@ -5254,6 +5469,18 @@ while (ptr < ptrend) cb->named_groups[cb->names_found].isdup = (uint16_t)isdupname; cb->names_found++; break; + + + /* ---- Perl extended character class ---- */ + + /* These are of the form '(?[...])'. We handle these via the same parser + that consumes ordinary '[...]' classes, but with a flag set to activate + the extended behaviour. */ + + case CHAR_LEFT_SQUARE_BRACKET: + class_mode_state = CLASS_MODE_PERL_EXT; + c = *ptr++; + goto FROM_PERL_EXTENDED_CLASS; } /* End of (? switch */ break; /* End of ( handling */ @@ -5300,7 +5527,7 @@ while (ptr < ptrend) if (nest_depth == 0) /* Unmatched closing parenthesis */ { errorcode = ERR22; - goto FAILED_BACK; + goto FAILED_BACK; // TODO https://github.com/PCRE2Project/pcre2/issues/549 } nest_depth--; *parsed_pattern++ = META_KET; diff --git a/src/pcre2_compile.h b/src/pcre2_compile.h index e94ace19c..57a72afe2 100644 --- a/src/pcre2_compile.h +++ b/src/pcre2_compile.h @@ -62,7 +62,7 @@ enum { ERR0 = COMPILE_ERROR_BASE, ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90, ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97, ERR98, ERR99, ERR100, ERR101,ERR102,ERR103,ERR104,ERR105,ERR106,ERR107,ERR108,ERR109,ERR110, - ERR111,ERR112 }; + ERR111,ERR112,ERR113,ERR114,ERR115,ERR116 }; /* Code values for parsed patterns, which are stored in a vector of 32-bit unsigned ints. Values less than META_END are literal data values. The coding @@ -158,11 +158,14 @@ versions. */ #define META_MINMAX_PLUS 0x80420000u /* {n,m}+ repeat */ #define META_MINMAX_QUERY 0x80430000u /* {n,m}? repeat */ -/* These meta codes have no ordering constraints. */ +/* These meta codes must be kept in a group, with the OR/SUB/XOR in +this order. */ -#define META_ECLASS_OR 0x80440000u /* || in a class */ -#define META_ECLASS_AND 0x80450000u /* && in a class */ -#define META_ECLASS_SUB 0x80460000u /* -- in a class */ +#define META_ECLASS_AND 0x80440000u /* && (or &) in a class */ +#define META_ECLASS_OR 0x80450000u /* || (or |, +) in a class */ +#define META_ECLASS_SUB 0x80460000u /* -- (or -) in a class */ +#define META_ECLASS_XOR 0x80470000u /* ~~ (or ^) in a class */ +#define META_ECLASS_NOT 0x80480000u /* ! in a class */ /* Convenience aliases. */ diff --git a/src/pcre2_compile_class.c b/src/pcre2_compile_class.c index 2896e3ca7..ca44983be 100644 --- a/src/pcre2_compile_class.c +++ b/src/pcre2_compile_class.c @@ -61,9 +61,11 @@ b) none of the cases here: case META_CLASS_EMPTY: \ case META_CLASS_EMPTY_NOT: \ case META_CLASS_END: \ - case META_ECLASS_OR: \ case META_ECLASS_AND: \ - case META_ECLASS_SUB: + case META_ECLASS_OR: \ + case META_ECLASS_SUB: \ + case META_ECLASS_XOR: \ + case META_ECLASS_NOT: #else #define CLASS_END_CASES(meta) \ default: @@ -1798,9 +1800,11 @@ while (TRUE) switch (META_CODE(*ptr)) { case META_CLASS_END: - case META_ECLASS_OR: case META_ECLASS_AND: + case META_ECLASS_OR: case META_ECLASS_SUB: + case META_ECLASS_XOR: + case META_ECLASS_NOT: goto DONE; case META_CLASS_EMPTY_NOT: @@ -1865,6 +1869,139 @@ return TRUE; +/* This function consumes unary prefix operators. */ + +static BOOL +compile_class_unary(uint32_t options, uint32_t xoptions, uint32_t **pptr, + PCRE2_UCHAR **pcode, int *errorcodeptr, compile_block *cb, + PCRE2_SIZE *lengthptr) +{ +uint32_t *ptr = *pptr; +PCRE2_UCHAR *code = *pcode; +BOOL negated = FALSE; +#ifdef PCRE2_DEBUG +PCRE2_UCHAR *start_code = *pcode; +#endif + +while (*ptr == META_ECLASS_NOT) + { + ++ptr; + negated = !negated; + } + +/* Because it's a non-empty class, there must be an operand. */ +if (!compile_class_operand(options, xoptions, &ptr, &code, errorcodeptr, cb, + lengthptr)) + return FALSE; + +/* Convert prefix to postfix (RPN). */ +if (negated) + { + if (lengthptr != NULL) + (*lengthptr)++; + else + *code++ = OP_ECLASS_NOT; + } + +PCRE2_ASSERT(lengthptr == NULL || code == start_code); + +*pptr = ptr; +*pcode = code; +return TRUE; +} + + + +/* This function consumes tightly-binding binary operators. */ + +static BOOL +compile_class_binary_tight(uint32_t options, uint32_t xoptions, uint32_t **pptr, + PCRE2_UCHAR **pcode, int *errorcodeptr, compile_block *cb, + PCRE2_SIZE *lengthptr) +{ +uint32_t *ptr = *pptr; +PCRE2_UCHAR *code = *pcode; +#ifdef PCRE2_DEBUG +PCRE2_UCHAR *start_code = *pcode; +#endif + +/* Because it's a non-empty class, there must be an operand at the start. */ +if (!compile_class_unary(options, xoptions, &ptr, &code, errorcodeptr, cb, + lengthptr)) + return FALSE; + +while (*ptr == META_ECLASS_AND) + { + uint32_t op = OP_ECLASS_AND; + ++ptr; + + /* An operand must follow the operator. */ + if (!compile_class_unary(options, xoptions, &ptr, &code, errorcodeptr, cb, + lengthptr)) + return FALSE; + + /* Convert infix to postfix (RPN). */ + if (lengthptr != NULL) + (*lengthptr)++; + else + *code++ = op; + } + +PCRE2_ASSERT(lengthptr == NULL || code == start_code); + +*pptr = ptr; +*pcode = code; +return TRUE; +} + + + +/* This function consumes loosely-binding binary operators. */ + +static BOOL +compile_class_binary_loose(uint32_t options, uint32_t xoptions, uint32_t **pptr, + PCRE2_UCHAR **pcode, int *errorcodeptr, compile_block *cb, + PCRE2_SIZE *lengthptr) +{ +uint32_t *ptr = *pptr; +PCRE2_UCHAR *code = *pcode; +#ifdef PCRE2_DEBUG +PCRE2_UCHAR *start_code = *pcode; +#endif + +/* Because it's a non-empty class, there must be an operand at the start. */ +if (!compile_class_binary_tight(options, xoptions, &ptr, &code, errorcodeptr, + cb, lengthptr)) + return FALSE; + +while (*ptr >= META_ECLASS_OR && *ptr <= META_ECLASS_XOR) + { + uint32_t op = *ptr == META_ECLASS_OR ? OP_ECLASS_OR : + *ptr == META_ECLASS_SUB ? OP_ECLASS_SUB : + OP_ECLASS_XOR; + ++ptr; + + /* An operand must follow the operator. */ + if (!compile_class_binary_tight(options, xoptions, &ptr, &code, errorcodeptr, + cb, lengthptr)) + return FALSE; + + /* Convert infix to postfix (RPN). */ + if (lengthptr != NULL) + (*lengthptr)++; + else + *code++ = op; + } + +PCRE2_ASSERT(lengthptr == NULL || code == start_code); + +*pptr = ptr; +*pcode = code; +return TRUE; +} + + + /* This function converts the META codes in pptr into opcodes written to pcode. The pptr must start at a META_CLASS or META_CLASS_NOT. @@ -1898,29 +2035,10 @@ PCRE2_ASSERT(*ptr == (META_CLASS | CLASS_IS_ECLASS) || negated = *ptr++ == (META_CLASS_NOT | CLASS_IS_ECLASS); /* Because it's a non-empty class, there must be an operand at the start. */ -if (!compile_class_operand(options, xoptions, &ptr, &code, errorcodeptr, cb, - lengthptr)) +if (!compile_class_binary_loose(options, xoptions, &ptr, &code, errorcodeptr, + cb, lengthptr)) return FALSE; -while (*ptr >= META_ECLASS_OR && *ptr <= META_ECLASS_SUB) - { - uint32_t op = *ptr == META_ECLASS_OR ? OP_ECLASS_OR : - *ptr == META_ECLASS_AND ? OP_ECLASS_AND : - OP_ECLASS_SUB; - ++ptr; - - /* An operand must follow the operator. */ - if (!compile_class_operand(options, xoptions, &ptr, &code, errorcodeptr, cb, - lengthptr)) - return FALSE; - - /* Convert infix to postfix (RPN). */ - if (lengthptr != NULL) - (*lengthptr)++; - else - *code++ = op; - } - if (negated) { if (lengthptr != NULL) diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index a674bb7a8..aa1f58cc6 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -191,7 +191,7 @@ static const uint8_t coptable[] = { 0, 0, 0, /* CLOSE, SKIPZERO, DEFINE */ 0, 0, /* \B and \b in UCP mode */ 0, /* ECLASS */ - 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ + 0, 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ }; /* This table identifies those opcodes that inspect a character. It is used to @@ -272,7 +272,7 @@ static const uint8_t poptable[] = { 0, 0, 0, /* CLOSE, SKIPZERO, DEFINE */ 1, 1, /* \B and \b in UCP mode */ 1, /* ECLASS */ - 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ + 0, 0, 0, 0, 0 /* ECLASS ops, nested inside ECLASS */ }; /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W, diff --git a/src/pcre2_error.c b/src/pcre2_error.c index 771b40ac9..5e1c3dbef 100644 --- a/src/pcre2_error.c +++ b/src/pcre2_error.c @@ -196,13 +196,18 @@ static const unsigned char compile_error_texts[] = /* 105 */ "PCRE2_EXTRA_TURKISH_CASING requires UTF in 8-bit mode\0" "PCRE2_EXTRA_TURKISH_CASING and PCRE2_EXTRA_CASELESS_RESTRICT are not compatible\0" - "character classes are too deeply nested\0" + "character class nesting is too deep\0" "invalid operator in character class\0" "unexpected operator in character class (no preceding operand)\0" /* 110 */ "expected operand after operator in character class\0" "brackets needed to clarify operator precedence in character class\0" "missing terminating ] for character class (note '[' must be escaped under PCRE2_ALT_EXTENDED_CLASS)\0" + "unexpected expression in character class (no preceding operator)\0" + "empty expression in character class\0" + /* 115 */ + "unexpected ] with no following parenthesis in (?[...\0" + "unexpected character in (?[...]) character class\0" ; /* Match-time and UTF error texts are in the same format. */ diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h index 55d950425..379d23c95 100644 --- a/src/pcre2_internal.h +++ b/src/pcre2_internal.h @@ -577,7 +577,10 @@ modes. */ /* The maximum nesting depth for Unicode character class sets. Currently fixed. Warning: the interpreter relies on this so it can encode the operand stack in a uint32_t. A nesting limit of 15 implies (15*2+1)=31 -stack operands required. */ +stack operands required, due to the fact that we have two (and only two) +levels of operator precedence. In the UTS#18 syntax, you can write 'x&&y[z]' +and in Perl syntax you can write '(?[ x - y & (z) ])', both of which imply +pushing the match results for x & y to the stack. */ #define ECLASS_NEST_LIMIT 15 @@ -1749,10 +1752,11 @@ enum { /* These are used for "extended classes" such as [a-z -- aeiou]. */ OP_ECLASS, /* 172 */ - OP_ECLASS_OR, /* 173 */ - OP_ECLASS_AND, /* 174 */ + OP_ECLASS_AND, /* 173 */ + OP_ECLASS_OR, /* 174 */ OP_ECLASS_SUB, /* 175 */ - OP_ECLASS_NOT, /* 176 */ + OP_ECLASS_XOR, /* 176 */ + OP_ECLASS_NOT, /* 177 */ /* This is not an opcode, but is used to check that tables indexed by opcode are the correct length, in order to catch updating errors - there have been @@ -1815,7 +1819,7 @@ some cases doesn't actually use these names at all). */ "*THEN", "*THEN", "*COMMIT", "*COMMIT", "*FAIL", \ "*ACCEPT", "*ASSERT_ACCEPT", \ "Close", "Skip zero", "Define", "\\B (ucp)", "\\b (ucp)", \ - "eclass", "||", "&&", "--", "!!" + "eclass", "&&", "||", "--", "~~", "!!" /* This macro defines the length of fixed length operations in the compiled @@ -1916,7 +1920,7 @@ in UTF-8 mode. The code that uses this table must know about such things. */ 1, /* DEFINE */ \ 1, 1, /* \B and \b in UCP mode */ \ 0, /* ECLASS - variable length */ \ - 1, 1, 1, 1 /* ECLASS ops, nested inside ECLASS */ + 1, 1, 1, 1, 1 /* ECLASS ops, nested inside ECLASS */ /* A magic value for OP_RREF to indicate the "any recursion" condition. */ diff --git a/src/pcre2_printint.c b/src/pcre2_printint.c index a1df310c7..54dd32de7 100644 --- a/src/pcre2_printint.c +++ b/src/pcre2_printint.c @@ -966,10 +966,6 @@ for(;;) fprintf(f, " op: &&\n"); ccode += OP_lengths[*ccode]; break; - case OP_ECLASS_NOT: - fprintf(f, " op: ^\n"); - ccode += OP_lengths[*ccode]; - break; case OP_ECLASS_OR: fprintf(f, " op: ||\n"); ccode += OP_lengths[*ccode]; @@ -978,6 +974,14 @@ for(;;) fprintf(f, " op: --\n"); ccode += OP_lengths[*ccode]; break; + case OP_ECLASS_XOR: + fprintf(f, " op: ~~\n"); + ccode += OP_lengths[*ccode]; + break; + case OP_ECLASS_NOT: + fprintf(f, " op: ^\n"); + ccode += OP_lengths[*ccode]; + break; /* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537 Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */ diff --git a/src/pcre2_xclass.c b/src/pcre2_xclass.c index 1d16bddea..6b2b110b2 100644 --- a/src/pcre2_xclass.c +++ b/src/pcre2_xclass.c @@ -474,14 +474,14 @@ while (ptr < data_end) { switch (*ptr) { - case OP_ECLASS_OR: + case OP_ECLASS_AND: ++ptr; - stack = (stack >> 1) | (stack & (uint32_t)1u); + stack = (stack >> 1) & (stack | ~(uint32_t)1u); break; - case OP_ECLASS_AND: + case OP_ECLASS_OR: ++ptr; - stack = (stack >> 1) & (stack | ~(uint32_t)1u); + stack = (stack >> 1) | (stack & (uint32_t)1u); break; case OP_ECLASS_SUB: @@ -489,6 +489,11 @@ while (ptr < data_end) stack = (stack >> 1) & (~stack | ~(uint32_t)1u); break; + case OP_ECLASS_XOR: + ++ptr; + stack = (stack >> 1) ^ (stack & (uint32_t)1u); + break; + case OP_ECLASS_NOT: ++ptr; stack ^= (uint32_t)1u; diff --git a/testdata/testinput1 b/testdata/testinput1 index 1e50369fc..10ee898cd 100644 --- a/testdata/testinput1 +++ b/testdata/testinput1 @@ -6729,4 +6729,311 @@ $/x z \ \ +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\n])/ + \n +\= Expect no match + \\ + n + +/^(?[\x61])b/ + ab +\= Expect no match + b + a + +/^(?[\x61])+b/ + ab + aab +\= Expect no match + b + +/(?[ [[:graph:]] ])/ + a +\= Expect no match + \x01 + +/(?[ [:graph:] ])/ + a +\= Expect no match + \x01 + +/(?[ [[:graph:]\x02] ])/ + a + \x02 +\= Expect no match + \x01 + +/(?[\E\n])/ + \n +\= Expect no match + \\ + E + +/(?[\n \Q\E])/ + \n +\= Expect no match + \\ + Q + +/(?[ ( \x02 + [:graph:] ) | [ \x02 [:graph:] ] ])/ + a + \x02 +\= Expect no match + \x01 + +/(?[ \d ])/ + 1 +\= Expect no match + d + +/(?[[1]])/ + 1 +\= Expect no match + ] + +/(?[[a]])/ + a +\= Expect no match + ] + +/(?[[a-c]])/ + a + b +\= Expect no match + - + ] + +/(?[ [\t] + [\n] ])/ + \t + \n +\= Expect no match + t + \\ + [ + +/(?[ \t + \n ])/ + \t + \n +\= Expect no match + t + \\ + [ + +/(?[ [()] ])/ + ) + ( +\= Expect no match + ] + +/(?[ ( [()] ) ])/ + ) + ( +\= Expect no match + ] + +/(?[ (( [\n\t] )) ])/ + \n + \t +\= Expect no match + ) + ( + t + +# Each syntax element, with unary operator applied to it + +/(?[ !\n ])/ + z +\= Expect no match + \n + +/(?[ !\d ])/ + a +\= Expect no match + 1 + +/(?[ ![:alpha:] ])/ + 1 +\= Expect no match + a + +/(?[ ![\n] ])/ + z +\= Expect no match + \n + +/(?[ !(\n) ])/ + z +\= Expect no match + \n + +/(?[ !!\n ])/ + \n +\= Expect no match + z + +# Each syntax element, as contents of parens + +/(?[ (\n) ])/ + \n +\= Expect no match + z + +/(?[ (\d) ])/ + 1 +\= Expect no match + a + +/(?[ ([:alpha:]) ])/ + a +\= Expect no match + 1 + +/(?[ ([\n]) ])/ + \n +\= Expect no match + z + +/(?[ ((\n)) ])/ + \n +\= Expect no match + z + +/(?[ (!\n) ])/ + z +\= Expect no match + \n + +/(?[ (\n + \t) ])/ + \n + \t +\= Expect no match + z + +# Each syntax element, as LHS of a binary operator + +/(?[ \n & [\n\t] ])/ + \n +\= Expect no match + t + +/(?[ \d & [\d\t] ])/ + 1 +\= Expect no match + a + +/(?[ [:alpha:] & [a-z\t] ])/ + a +\= Expect no match + A + \t + +/(?[ [\n] & [\n\t] ])/ + \n +\= Expect no match + \t + +/(?[ (\n) & [\n\t] ])/ + \n +\= Expect no match + \t + +/(?[ !\n & [^\n\t] ])/ + a +\= Expect no match + \n + \t + +/(?[ \n & [\n\t] + [\d] ])/ + \n + 1 +\= Expect no match + \t + a + +# Each syntax element, as RHS of a binary operator + +/(?[ [\n\t] & \n ])/ + \n +\= Expect no match + t + +/(?[ [\d\t] & \d ])/ + 1 +\= Expect no match + a + +/(?[ [a-z\t] & [:alpha:] ])/ + a +\= Expect no match + A + \t + +/(?[ [\n\t] & [\n] ])/ + \n +\= Expect no match + \t + +/(?[ [\n\t] & (\n) ])/ + \n +\= Expect no match + \t + +/(?[ [^\n\t] & !\n ])/ + a +\= Expect no match + \n + \t + +/(?[ [\d] + \n & [\n\t] ])/ + \n + 1 +\= Expect no match + \t + a + +/(?[ [\d] + \n + [\t] ])/ + \n + \t + 1 +\= Expect no match + a + +# end op surrounding syntax tests + +/(?[ \d + \n ])/ + \n + 1 +\= Expect no match + a + +/(?[ \d | \n ])/ + \n + 1 +\= Expect no match + a + +/(?[ \d - [2] ])/ + 1 + 3 +\= Expect no match + 2 + +/(?[ [AC] ^ [BC] ])/ + A + B +\= Expect no match + C + D + +/(?[ ( [ ^ z ] ) ])/ + j +\= Expect no match + z + +# -------------- + # End of testinput1 diff --git a/testdata/testinput11 b/testdata/testinput11 index 26d976e9e..c7d237a2f 100644 --- a/testdata/testinput11 +++ b/testdata/testinput11 @@ -476,4 +476,29 @@ # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +# META_BIGVALUE tests + +/(?[[\x{80000000}-\x{8000000f}]+\x{8fffffff}])/B + \x{80000002} + \x{8fffffff} +\= Expect no match + \x{7fffffff} + \x{90000000} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +# -------------- + # End of testinput11 diff --git a/testdata/testinput2 b/testdata/testinput2 index 1fbb778e0..e4666bf9d 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -3975,6 +3975,12 @@ no reduction Alfred Winifred +/[[:<:]]+red/B + little red riding hood + red is a colour +\= Expect no match + Alfred + /[a[:<:]] should give error/ /(?=ab\K)/aftertext,allow_lookaround_bsk @@ -6707,6 +6713,34 @@ a)"xI B C +/[[AC]||[BC]]/B,alt_extended_class + A + B + C +\= Expect no match + D + +/[[AC]&&[BC]]/B,alt_extended_class + C +\= Expect no match + A + B + D + +/[[AC]--[BC]]/B,alt_extended_class + A +\= Expect no match + B + C + D + +/[[AC]~~[BC]]/B,alt_extended_class + A + B +\= Expect no match + C + D + /[A[]]]/B,alt_extended_class A ] @@ -6970,8 +7004,24 @@ a)"xI /[a|||b]/alt_extended_class +/[a~~~b]/alt_extended_class + +/[a~~~~b]/alt_extended_class + +/[a~~/alt_extended_class + +/[a~~~/alt_extended_class + +/[a~~~~/alt_extended_class + /[a||b&&c]/alt_extended_class +/[a||b~~c]/alt_extended_class + +/[a~~b&&c]/alt_extended_class + +/[a--b~~c]/alt_extended_class + /[a--b&&c]/alt_extended_class /[a||b--c]/alt_extended_class @@ -6998,6 +7048,120 @@ a)"xI /[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a&&a[z]]]]]]]]]]]]]]]]/alt_extended_class +/[z&/alt_extended_class + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +# allow-empty-class does nothing inside (?[...]) +/(?[ []] ])/B,allow_empty_class + ] + +# bad-escape-is-literal does nothing inside (?[...]) +/[ \j ]/ + +/[ /\ + +/(?[ \j ])/ + +/(?[ /\ + +/[ \j ]/bad_escape_is_literal + j +\= Expect no match + k + +/[ /\bad_escape_is_literal + +/(?[ \j ])/bad_escape_is_literal + +/(?[ /\bad_escape_is_literal + +/(?[ [\j] ])/bad_escape_is_literal + +/(?[ (\j) ])/bad_escape_is_literal + +# We can't test error cases in testinput1 + +/(?[])/ + +/(?[/ + +/(?[]/ + +/(?[\n/ + +/(?[\n]/ + +/(?[\n]z)/ + +/(?[\n] )/ + +/(?[(/ + +/(?[( / + +/(?[(\n/ + +/(?[ \n + () ])/ + +/(?[1])/ + +/(?[a])/ + +/(?[a-c])/ + +/(?[(])/ + +/(?[(\n])/ + +/(?[\n)])/ + +/(?[^\n])/ + +/(?[ \n \t ])/ + +/(?[ \d \t ])/ + +/(?[ [\n] \t ])/ + +/(?[ (\n) \t ])/ + +/(?[ [:alpha:] \t ])/ + +/(?[ \n + \t \d ])/ + +/(?[ !\n \t ])/ + +/(?[ \n [:alpha:] ])/ + +/(?[ \n [\d] ])/ + +/(?[ \n (\t) ])/ + +/(?[ \n !\t ])/ + +/(?[ \n \t ])/ + +/(?[:graph:])/ + +/(?[\Qn\E])/ + +# maximum depth tests + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&\n))))))))))))))])/ + \n +\= Expect no match + a + b + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+([\n]&\n))))))))))))))])/ + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&[\n]))))))))))))))])/ + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+((\n)&\n))))))))))))))])/ + # -------------- /[[:digit:] -Z]/xx @@ -7016,4 +7180,41 @@ a)"xI /[\d-z]/ +/[\d-\w]/ + +/[\Q/ + +/[\Q/\ + +/[\Q\E/ + +/[\Q\n/ + +/[\Q\n]/ + +/[\Q\n/\ + +/[\Q\n\]/ + +/[\Q\n\E/ + +/[\Q\n\E]/ + \\ + n +\= Expect no match + \n + Q + +/[z\Q/ + +/[z\Q/\ + +/[z\Q\E/ + +/[/\ + +/[\n/ + +/[\E/ + # End of testinput2 diff --git a/testdata/testinput4 b/testdata/testinput4 index c0c17e103..8b3e0c7b8 100644 --- a/testdata/testinput4 +++ b/testdata/testinput4 @@ -2949,4 +2949,96 @@ A \x{3a3} +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\p{L} - \p{Lu}])/ + a +\= Expect no match + A + 1 + +/(?[\p{L} & \p{Lu}])/ + A +\= Expect no match + a + 1 + +/(?[[\p{Lu}z] ^ [\p{Ll}G]])/ + A + p +\= Expect no match + G + z + 1 + +/(?[\p{Ll} | \p{Nd}])/ + a + 1 +\= Expect no match + A + +/(?[\p{Ll} + [\p{Nd}]])/ + a + 1 +\= Expect no match + A + +/(?[ ![\p{Nd}z] ])/ + _ + Z +\= Expect no match + 1 + z + +/(?[ \P{Nd} + [2] ])/ + _ + Z + 2 +\= Expect no match + 1 + 3 + +/(?[ ![\P{Nd}] ])/ + 1 + 2 +\= Expect no match + _ + z + +# caseless tests + +/(?[ \p{Lu} ^ \p{Ll} ])/ + a + A +\= Expect no match + _ + 1 + +/(?[ [\p{Lu}1] ^ \p{Ll} ])/i + 1 +\= Expect no match + a + A + _ + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/ + 1 +\= Expect no match + a + A + _ + 2 + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/i + a + A + 1 +\= Expect no match + _ + 2 + +# -------------- + # End of testinput4 diff --git a/testdata/testinput5 b/testdata/testinput5 index 503ffa654..05e7fe613 100644 --- a/testdata/testinput5 +++ b/testdata/testinput5 @@ -2927,6 +2927,34 @@ A 0 +/[[\p{Lu}\p{Ll}]||[\p{Nd}\p{Ll}]]/B,alt_extended_class + A + 1 + c +\= Expect no match + _ + +/[[\p{Lu}\p{Ll}]&&[\p{Nd}\p{Ll}]]/B,alt_extended_class + c +\= Expect no match + A + 1 + _ + +/[[\p{Lu}\p{Ll}]--[\p{Nd}\p{Ll}]]/B,alt_extended_class + A +\= Expect no match + 1 + c + _ + +/[[\p{Lu}\p{Ll}]~~[\p{Nd}\p{Ll}]]/B,alt_extended_class + A + 1 +\= Expect no match + c + _ + /[\pL[]]]/B,alt_extended_class A ] @@ -3178,12 +3206,59 @@ \= Expect no match n_n -/[\p{Nd}||[\pL--\p{Lu}]]/alt_extended_class +/[\p{Nd}||[\pL--\p{Lu}]]/B,alt_extended_class a 0 \= Expect no match C +/[\P{Nd}||2]/B,alt_extended_class + _ + Z + 2 +\= Expect no match + 1 + 3 + +/[^[\P{Nd}]]/B,alt_extended_class + 1 + 2 +\= Expect no match + _ + z + +# caseless tests + +/[\p{Lu}~~\p{Ll}]/B,alt_extended_class + a + A +\= Expect no match + _ + 1 + +/[[\p{Lu}1]~~\p{Ll}]/iB,alt_extended_class + 1 +\= Expect no match + a + A + _ + +/[[\p{Lu}1]&&[\p{Ll}1]]/B,alt_extended_class + 1 +\= Expect no match + a + A + _ + 2 + +/[[\p{Lu}1]&&[\p{Ll}1]]/iB,alt_extended_class + a + A + 1 +\= Expect no match + _ + 2 + /[[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] diff --git a/testdata/testinput6 b/testdata/testinput6 index ac3315b7c..1fbe4ce8d 100644 --- a/testdata/testinput6 +++ b/testdata/testinput6 @@ -5189,4 +5189,15 @@ # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[A]+[B]])/ + A + B +\= Expect no match + [ + ] + +# -------------- + # End of testinput6 diff --git a/testdata/testinput7 b/testdata/testinput7 index d91ea2854..56cd49e3f 100644 --- a/testdata/testinput7 +++ b/testdata/testinput7 @@ -2682,6 +2682,64 @@ >aBBd< >Abb\x{01c5}< +# -------------- + +# EXTENDED CHARACTER CLASSES + +/[\p{Ll}[\p{Nd}]]C/alt_extended_class + aC + 1C +\= Expect no match + [C + +/[[\p{Ll}][\p{Nd}]]/alt_extended_class + a + 1 +\= Expect no match + [ + ] + +/[[\p{Ll}]||[\p{Nd}]]/alt_extended_class + a + 1 +\= Expect no match + C + +/[[^\p{Ll}][\p{Nd}]]/alt_extended_class + 1 + A +\= Expect no match + a + +/[^[\p{Ll}][\p{Nd}]]/alt_extended_class + A +\= Expect no match + a + 1 + +/[^[\p{Ll}]&&[\p{Nd}]]/alt_extended_class + a + 1 + A + +/(?[[\p{Ll}]+[\p{Nd}]])/ + a + 1 +\= Expect no match + [ + ] + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[\p{Ll}Z]&[\p{Lu}a]])/ + a + Z +\= Expect no match + A + z + # -------------------------------------------------------------------------- # End of testinput7 diff --git a/testdata/testoutput1 b/testdata/testoutput1 index 6f9277292..1e20d9d2c 100644 --- a/testdata/testoutput1 +++ b/testdata/testoutput1 @@ -10611,4 +10611,449 @@ No match \ \ No match +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\n])/ + \n + 0: \x0a +\= Expect no match + \\ +No match + n +No match + +/^(?[\x61])b/ + ab + 0: ab +\= Expect no match + b +No match + a +No match + +/^(?[\x61])+b/ + ab + 0: ab + aab + 0: aab +\= Expect no match + b +No match + +/(?[ [[:graph:]] ])/ + a + 0: a +\= Expect no match + \x01 +No match + +/(?[ [:graph:] ])/ + a + 0: a +\= Expect no match + \x01 +No match + +/(?[ [[:graph:]\x02] ])/ + a + 0: a + \x02 + 0: \x02 +\= Expect no match + \x01 +No match + +/(?[\E\n])/ + \n + 0: \x0a +\= Expect no match + \\ +No match + E +No match + +/(?[\n \Q\E])/ + \n + 0: \x0a +\= Expect no match + \\ +No match + Q +No match + +/(?[ ( \x02 + [:graph:] ) | [ \x02 [:graph:] ] ])/ + a + 0: a + \x02 + 0: \x02 +\= Expect no match + \x01 +No match + +/(?[ \d ])/ + 1 + 0: 1 +\= Expect no match + d +No match + +/(?[[1]])/ + 1 + 0: 1 +\= Expect no match + ] +No match + +/(?[[a]])/ + a + 0: a +\= Expect no match + ] +No match + +/(?[[a-c]])/ + a + 0: a + b + 0: b +\= Expect no match + - +No match + ] +No match + +/(?[ [\t] + [\n] ])/ + \t + 0: \x09 + \n + 0: \x0a +\= Expect no match + t +No match + \\ +No match + [ +No match + +/(?[ \t + \n ])/ + \t + 0: \x09 + \n + 0: \x0a +\= Expect no match + t +No match + \\ +No match + [ +No match + +/(?[ [()] ])/ + ) + 0: ) + ( + 0: ( +\= Expect no match + ] +No match + +/(?[ ( [()] ) ])/ + ) + 0: ) + ( + 0: ( +\= Expect no match + ] +No match + +/(?[ (( [\n\t] )) ])/ + \n + 0: \x0a + \t + 0: \x09 +\= Expect no match + ) +No match + ( +No match + t +No match + +# Each syntax element, with unary operator applied to it + +/(?[ !\n ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ !\d ])/ + a + 0: a +\= Expect no match + 1 +No match + +/(?[ ![:alpha:] ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ ![\n] ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ !(\n) ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ !!\n ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +# Each syntax element, as contents of parens + +/(?[ (\n) ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +/(?[ (\d) ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ ([:alpha:]) ])/ + a + 0: a +\= Expect no match + 1 +No match + +/(?[ ([\n]) ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +/(?[ ((\n)) ])/ + \n + 0: \x0a +\= Expect no match + z +No match + +/(?[ (!\n) ])/ + z + 0: z +\= Expect no match + \n +No match + +/(?[ (\n + \t) ])/ + \n + 0: \x0a + \t + 0: \x09 +\= Expect no match + z +No match + +# Each syntax element, as LHS of a binary operator + +/(?[ \n & [\n\t] ])/ + \n + 0: \x0a +\= Expect no match + t +No match + +/(?[ \d & [\d\t] ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ [:alpha:] & [a-z\t] ])/ + a + 0: a +\= Expect no match + A +No match + \t +No match + +/(?[ [\n] & [\n\t] ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ (\n) & [\n\t] ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ !\n & [^\n\t] ])/ + a + 0: a +\= Expect no match + \n +No match + \t +No match + +/(?[ \n & [\n\t] + [\d] ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + \t +No match + a +No match + +# Each syntax element, as RHS of a binary operator + +/(?[ [\n\t] & \n ])/ + \n + 0: \x0a +\= Expect no match + t +No match + +/(?[ [\d\t] & \d ])/ + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ [a-z\t] & [:alpha:] ])/ + a + 0: a +\= Expect no match + A +No match + \t +No match + +/(?[ [\n\t] & [\n] ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ [\n\t] & (\n) ])/ + \n + 0: \x0a +\= Expect no match + \t +No match + +/(?[ [^\n\t] & !\n ])/ + a + 0: a +\= Expect no match + \n +No match + \t +No match + +/(?[ [\d] + \n & [\n\t] ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + \t +No match + a +No match + +/(?[ [\d] + \n + [\t] ])/ + \n + 0: \x0a + \t + 0: \x09 + 1 + 0: 1 +\= Expect no match + a +No match + +# end op surrounding syntax tests + +/(?[ \d + \n ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ \d | \n ])/ + \n + 0: \x0a + 1 + 0: 1 +\= Expect no match + a +No match + +/(?[ \d - [2] ])/ + 1 + 0: 1 + 3 + 0: 3 +\= Expect no match + 2 +No match + +/(?[ [AC] ^ [BC] ])/ + A + 0: A + B + 0: B +\= Expect no match + C +No match + D +No match + +/(?[ ( [ ^ z ] ) ])/ + j + 0: j +\= Expect no match + z +No match + +# -------------- + # End of testinput1 diff --git a/testdata/testoutput11-16 b/testdata/testoutput11-16 index 7101d20c9..6cc49d33f 100644 --- a/testdata/testoutput11-16 +++ b/testdata/testoutput11-16 @@ -822,4 +822,32 @@ Failed: error 134 at offset 13: character code point value in \x{} or \o{} is to # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +# META_BIGVALUE tests + +/(?[[\x{80000000}-\x{8000000f}]+\x{8fffffff}])/B +Failed: error 134 at offset 15: character code point value in \x{} or \o{} is too large + \x{80000002} + \x{8fffffff} +\= Expect no match + \x{7fffffff} + \x{90000000} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +Failed: error 134 at offset 15: character code point value in \x{} or \o{} is too large + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +Failed: error 134 at offset 15: character code point value in \x{} or \o{} is too large + \x{80000001} + \x{80000003} +\= Expect no match + \x{80000002} + +# -------------- + # End of testinput11 diff --git a/testdata/testoutput11-32 b/testdata/testoutput11-32 index 3d0a75332..8fc94506c 100644 --- a/testdata/testoutput11-32 +++ b/testdata/testoutput11-32 @@ -939,4 +939,69 @@ No match # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +# META_BIGVALUE tests + +/(?[[\x{80000000}-\x{8000000f}]+\x{8fffffff}])/B +------------------------------------------------------------------ + Bra + eclass[ + cls:[\x{80000000}-\x{8000000f}] + cls:[\x{8fffffff}] + op: || + ] + Ket + End +------------------------------------------------------------------ + \x{80000002} + 0: \x{80000002} + \x{8fffffff} + 0: \x{8fffffff} +\= Expect no match + \x{7fffffff} +No match + \x{90000000} +No match + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +------------------------------------------------------------------ + Bra + eclass[ + cls:[\x{80000000}-\x{8000000f}] + cls:[\x{80000002}] + op: -- + ] + Ket + End +------------------------------------------------------------------ + \x{80000001} + 0: \x{80000001} + \x{80000003} + 0: \x{80000003} +\= Expect no match + \x{80000002} +No match + +/(?[[\x{80000000}-\x{8000000f}]-\x{80000002}])/B +------------------------------------------------------------------ + Bra + eclass[ + cls:[\x{80000000}-\x{8000000f}] + cls:[\x{80000002}] + op: -- + ] + Ket + End +------------------------------------------------------------------ + \x{80000001} + 0: \x{80000001} + \x{80000003} + 0: \x{80000003} +\= Expect no match + \x{80000002} +No match + +# -------------- + # End of testinput11 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index 99714596e..3572e3258 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -13476,6 +13476,29 @@ No match Alfred Winifred No match +/[[:<:]]+red/B +------------------------------------------------------------------ + Bra + \b + Assert + \w + Ket + Brazero + Assert + \w + Ket + red + Ket + End +------------------------------------------------------------------ + little red riding hood + 0: red + red is a colour + 0: red +\= Expect no match + Alfred +No match + /[a[:<:]] should give error/ Failed: error 130 at offset 7: unknown POSIX class name @@ -19744,6 +19767,90 @@ No match C 0: C +/[[AC]||[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: || + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + B + 0: B + C + 0: C +\= Expect no match + D +No match + +/[[AC]&&[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: && + ] + Ket + End +------------------------------------------------------------------ + C + 0: C +\= Expect no match + A +No match + B +No match + D +No match + +/[[AC]--[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: -- + ] + Ket + End +------------------------------------------------------------------ + A + 0: A +\= Expect no match + B +No match + C +No match + D +No match + +/[[AC]~~[BC]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[AC] + cls:[BC] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + B + 0: B +\= Expect no match + C +No match + D +No match + /[A[]]]/B,alt_extended_class ------------------------------------------------------------------ Bra @@ -20600,9 +20707,33 @@ Failed: error 208 at offset 5: invalid operator in character class /[a|||b]/alt_extended_class Failed: error 208 at offset 5: invalid operator in character class +/[a~~~b]/alt_extended_class +Failed: error 208 at offset 5: invalid operator in character class + +/[a~~~~b]/alt_extended_class +Failed: error 208 at offset 6: invalid operator in character class + +/[a~~/alt_extended_class +Failed: error 106 at offset 4: missing terminating ] for character class + +/[a~~~/alt_extended_class +Failed: error 208 at offset 5: invalid operator in character class + +/[a~~~~/alt_extended_class +Failed: error 208 at offset 6: invalid operator in character class + /[a||b&&c]/alt_extended_class Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class +/[a||b~~c]/alt_extended_class +Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class + +/[a~~b&&c]/alt_extended_class +Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class + +/[a--b~~c]/alt_extended_class +Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class + /[a--b&&c]/alt_extended_class Failed: error 211 at offset 7: brackets needed to clarify operator precedence in character class @@ -20646,10 +20777,183 @@ Failed: error 150 at offset 5: invalid range in character class No match /[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[z]&&a]]]]]]]]]]]]]]]/alt_extended_class -Failed: error 207 at offset 115: character classes are too deeply nested +Failed: error 207 at offset 115: character class nesting is too deep /[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a[b]&&a[a&&a[z]]]]]]]]]]]]]]]]/alt_extended_class -Failed: error 207 at offset 118: character classes are too deeply nested +Failed: error 207 at offset 118: character class nesting is too deep + +/[z&/alt_extended_class +Failed: error 106 at offset 3: missing terminating ] for character class + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +# allow-empty-class does nothing inside (?[...]) +/(?[ []] ])/B,allow_empty_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\]] + ] + Ket + End +------------------------------------------------------------------ + ] + 0: ] + +# bad-escape-is-literal does nothing inside (?[...]) +/[ \j ]/ +Failed: error 103 at offset 3: unrecognized character follows \ + +/[ /\ +Failed: error 101 at offset 3: \ at end of pattern + +/(?[ \j ])/ +Failed: error 103 at offset 5: unrecognized character follows \ + +/(?[ /\ +Failed: error 101 at offset 5: \ at end of pattern + +/[ \j ]/bad_escape_is_literal + j + 0: j +\= Expect no match + k +No match + +/[ /\bad_escape_is_literal +Failed: error 106 at offset 3: missing terminating ] for character class + +/(?[ \j ])/bad_escape_is_literal +Failed: error 103 at offset 5: unrecognized character follows \ + +/(?[ /\bad_escape_is_literal +Failed: error 101 at offset 5: \ at end of pattern + +/(?[ [\j] ])/bad_escape_is_literal +Failed: error 103 at offset 6: unrecognized character follows \ + +/(?[ (\j) ])/bad_escape_is_literal +Failed: error 103 at offset 6: unrecognized character follows \ + +# We can't test error cases in testinput1 + +/(?[])/ +Failed: error 214 at offset 4: empty expression in character class + +/(?[/ +Failed: error 106 at offset 3: missing terminating ] for character class + +/(?[]/ +Failed: error 214 at offset 4: empty expression in character class + +/(?[\n/ +Failed: error 106 at offset 5: missing terminating ] for character class + +/(?[\n]/ +Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[... + +/(?[\n]z)/ +Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[... + +/(?[\n] )/ +Failed: error 215 at offset 6: unexpected ] with no following parenthesis in (?[... + +/(?[(/ +Failed: error 114 at offset 4: missing closing parenthesis + +/(?[( / +Failed: error 106 at offset 5: missing terminating ] for character class + +/(?[(\n/ +Failed: error 106 at offset 6: missing terminating ] for character class + +/(?[ \n + () ])/ +Failed: error 214 at offset 11: empty expression in character class + +/(?[1])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[a])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[a-c])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[(])/ +Failed: error 114 at offset 4: missing closing parenthesis + +/(?[(\n])/ +Failed: error 114 at offset 6: missing closing parenthesis + +/(?[\n)])/ +Failed: error 122 at offset 6: unmatched closing parenthesis + +/(?[^\n])/ +Failed: error 209 at offset 4: unexpected operator in character class (no preceding operand) + +/(?[ \n \t ])/ +Failed: error 213 at offset 9: unexpected expression in character class (no preceding operator) + +/(?[ \d \t ])/ +Failed: error 213 at offset 9: unexpected expression in character class (no preceding operator) + +/(?[ [\n] \t ])/ +Failed: error 213 at offset 11: unexpected expression in character class (no preceding operator) + +/(?[ (\n) \t ])/ +Failed: error 213 at offset 11: unexpected expression in character class (no preceding operator) + +/(?[ [:alpha:] \t ])/ +Failed: error 213 at offset 16: unexpected expression in character class (no preceding operator) + +/(?[ \n + \t \d ])/ +Failed: error 213 at offset 14: unexpected expression in character class (no preceding operator) + +/(?[ !\n \t ])/ +Failed: error 213 at offset 10: unexpected expression in character class (no preceding operator) + +/(?[ \n [:alpha:] ])/ +Failed: error 213 at offset 16: unexpected expression in character class (no preceding operator) + +/(?[ \n [\d] ])/ +Failed: error 213 at offset 8: unexpected expression in character class (no preceding operator) + +/(?[ \n (\t) ])/ +Failed: error 213 at offset 8: unexpected expression in character class (no preceding operator) + +/(?[ \n !\t ])/ +Failed: error 213 at offset 8: unexpected expression in character class (no preceding operator) + +/(?[ \n \t ])/ +Failed: error 213 at offset 9: unexpected expression in character class (no preceding operator) + +/(?[:graph:])/ +Failed: error 216 at offset 4: unexpected character in (?[...]) character class + +/(?[\Qn\E])/ +Failed: error 216 at offset 6: unexpected character in (?[...]) character class + +# maximum depth tests + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&\n))))))))))))))])/ + \n + 0: \x0a +\= Expect no match + a +No match + b +No match + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+([\n]&\n))))))))))))))])/ +Failed: error 207 at offset 158: character class nesting is too deep + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n&[\n]))))))))))))))])/ +Failed: error 207 at offset 161: character class nesting is too deep + +/(?[\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+(\n+[a]&\n+((\n)&\n))))))))))))))])/ +Failed: error 207 at offset 158: character class nesting is too deep # -------------- @@ -20677,6 +20981,62 @@ Failed: error 150 at offset 11: invalid range in character class /[\d-z]/ Failed: error 150 at offset 4: invalid range in character class +/[\d-\w]/ +Failed: error 150 at offset 4: invalid range in character class + +/[\Q/ +Failed: error 106 at offset 3: missing terminating ] for character class + +/[\Q/\ +Failed: error 106 at offset 4: missing terminating ] for character class + +/[\Q\E/ +Failed: error 106 at offset 5: missing terminating ] for character class + +/[\Q\n/ +Failed: error 106 at offset 5: missing terminating ] for character class + +/[\Q\n]/ +Failed: error 106 at offset 6: missing terminating ] for character class + +/[\Q\n/\ +Failed: error 106 at offset 6: missing terminating ] for character class + +/[\Q\n\]/ +Failed: error 106 at offset 7: missing terminating ] for character class + +/[\Q\n\E/ +Failed: error 106 at offset 7: missing terminating ] for character class + +/[\Q\n\E]/ + \\ + 0: \ + n + 0: n +\= Expect no match + \n +No match + Q +No match + +/[z\Q/ +Failed: error 106 at offset 4: missing terminating ] for character class + +/[z\Q/\ +Failed: error 106 at offset 5: missing terminating ] for character class + +/[z\Q\E/ +Failed: error 106 at offset 6: missing terminating ] for character class + +/[/\ +Failed: error 101 at offset 2: \ at end of pattern + +/[\n/ +Failed: error 106 at offset 3: missing terminating ] for character class + +/[\E/ +Failed: error 106 at offset 3: missing terminating ] for character class + # End of testinput2 Error -70: PCRE2_ERROR_BADDATA (unknown error number) Error -62: bad serialized data diff --git a/testdata/testoutput4 b/testdata/testoutput4 index a23250314..fcde19f12 100644 --- a/testdata/testoutput4 +++ b/testdata/testoutput4 @@ -4722,4 +4722,144 @@ No match \x{3a3} No match +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[\p{L} - \p{Lu}])/ + a + 0: a +\= Expect no match + A +No match + 1 +No match + +/(?[\p{L} & \p{Lu}])/ + A + 0: A +\= Expect no match + a +No match + 1 +No match + +/(?[[\p{Lu}z] ^ [\p{Ll}G]])/ + A + 0: A + p + 0: p +\= Expect no match + G +No match + z +No match + 1 +No match + +/(?[\p{Ll} | \p{Nd}])/ + a + 0: a + 1 + 0: 1 +\= Expect no match + A +No match + +/(?[\p{Ll} + [\p{Nd}]])/ + a + 0: a + 1 + 0: 1 +\= Expect no match + A +No match + +/(?[ ![\p{Nd}z] ])/ + _ + 0: _ + Z + 0: Z +\= Expect no match + 1 +No match + z +No match + +/(?[ \P{Nd} + [2] ])/ + _ + 0: _ + Z + 0: Z + 2 + 0: 2 +\= Expect no match + 1 +No match + 3 +No match + +/(?[ ![\P{Nd}] ])/ + 1 + 0: 1 + 2 + 0: 2 +\= Expect no match + _ +No match + z +No match + +# caseless tests + +/(?[ \p{Lu} ^ \p{Ll} ])/ + a + 0: a + A + 0: A +\= Expect no match + _ +No match + 1 +No match + +/(?[ [\p{Lu}1] ^ \p{Ll} ])/i + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/ + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + 2 +No match + +/(?[ [\p{Lu}1] & [\p{Ll}1] ])/i + a + 0: a + A + 0: A + 1 + 0: 1 +\= Expect no match + _ +No match + 2 +No match + +# -------------- + # End of testinput4 diff --git a/testdata/testoutput5 b/testdata/testoutput5 index 00743f193..29dbd1678 100644 --- a/testdata/testoutput5 +++ b/testdata/testoutput5 @@ -6351,6 +6351,90 @@ No match 0 0: 0 +/[[\p{Lu}\p{Ll}]||[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: || + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + 1 + 0: 1 + c + 0: c +\= Expect no match + _ +No match + +/[[\p{Lu}\p{Ll}]&&[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: && + ] + Ket + End +------------------------------------------------------------------ + c + 0: c +\= Expect no match + A +No match + 1 +No match + _ +No match + +/[[\p{Lu}\p{Ll}]--[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: -- + ] + Ket + End +------------------------------------------------------------------ + A + 0: A +\= Expect no match + 1 +No match + c +No match + _ +No match + +/[[\p{Lu}\p{Ll}]~~[\p{Nd}\p{Ll}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}\p{Ll}] + cls:[\p{Nd}\p{Ll}] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + A + 0: A + 1 + 0: 1 +\= Expect no match + c +No match + _ +No match + /[\pL[]]]/B,alt_extended_class ------------------------------------------------------------------ Bra @@ -7158,7 +7242,19 @@ No match n_n No match -/[\p{Nd}||[\pL--\p{Lu}]]/alt_extended_class +/[\p{Nd}||[\pL--\p{Lu}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Nd}] + cls:[\p{L}] + cls:[\p{Lu}] + op: -- + op: || + ] + Ket + End +------------------------------------------------------------------ a 0: a 0 @@ -7167,6 +7263,139 @@ No match C No match +/[\P{Nd}||2]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\P{Nd}] + cls:[2] + op: || + ] + Ket + End +------------------------------------------------------------------ + _ + 0: _ + Z + 0: Z + 2 + 0: 2 +\= Expect no match + 1 +No match + 3 +No match + +/[^[\P{Nd}]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\P{Nd}] + op: ^ + ] + Ket + End +------------------------------------------------------------------ + 1 + 0: 1 + 2 + 0: 2 +\= Expect no match + _ +No match + z +No match + +# caseless tests + +/[\p{Lu}~~\p{Ll}]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[\p{Lu}] + cls:[\p{Ll}] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + a + 0: a + A + 0: A +\= Expect no match + _ +No match + 1 +No match + +/[[\p{Lu}1]~~\p{Ll}]/iB,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[1A-Za-z\xb5\xc0-\xd6\xd8-\xf6\xf8-\xff\p{Lc}] + cls:[\p{Lc}] + op: ~~ + ] + Ket + End +------------------------------------------------------------------ + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + +/[[\p{Lu}1]&&[\p{Ll}1]]/B,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[1A-Z\xc0-\xd6\xd8-\xde\p{Lu}] + cls:[1a-z\xb5\xdf-\xf6\xf8-\xff\p{Ll}] + op: && + ] + Ket + End +------------------------------------------------------------------ + 1 + 0: 1 +\= Expect no match + a +No match + A +No match + _ +No match + 2 +No match + +/[[\p{Lu}1]&&[\p{Ll}1]]/iB,alt_extended_class +------------------------------------------------------------------ + Bra + eclass[ + cls:[1A-Za-z\xb5\xc0-\xd6\xd8-\xf6\xf8-\xff\p{Lc}] + cls:[1A-Za-z\xb5\xc0-\xd6\xd8-\xf6\xf8-\xff\p{Lc}] + op: && + ] + Ket + End +------------------------------------------------------------------ + a + 0: a + A + 0: A + 1 + 0: 1 +\= Expect no match + _ +No match + 2 +No match + /[[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] &&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]]&&[\pN--[\pC||\x{9F5}]] diff --git a/testdata/testoutput6 b/testdata/testoutput6 index 757e49db6..74c40e904 100644 --- a/testdata/testoutput6 +++ b/testdata/testoutput6 @@ -8149,4 +8149,19 @@ No match # -------------- +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[A]+[B]])/ + A + 0: A + B + 0: B +\= Expect no match + [ +No match + ] +No match + +# -------------- + # End of testinput6 diff --git a/testdata/testoutput7 b/testdata/testoutput7 index eb27100d2..caf9eb25f 100644 --- a/testdata/testoutput7 +++ b/testdata/testoutput7 @@ -4462,6 +4462,91 @@ No match >Abb\x{01c5}< No match +# -------------- + +# EXTENDED CHARACTER CLASSES + +/[\p{Ll}[\p{Nd}]]C/alt_extended_class + aC + 0: aC + 1C + 0: 1C +\= Expect no match + [C +No match + +/[[\p{Ll}][\p{Nd}]]/alt_extended_class + a + 0: a + 1 + 0: 1 +\= Expect no match + [ +No match + ] +No match + +/[[\p{Ll}]||[\p{Nd}]]/alt_extended_class + a + 0: a + 1 + 0: 1 +\= Expect no match + C +No match + +/[[^\p{Ll}][\p{Nd}]]/alt_extended_class + 1 + 0: 1 + A + 0: A +\= Expect no match + a +No match + +/[^[\p{Ll}][\p{Nd}]]/alt_extended_class + A + 0: A +\= Expect no match + a +No match + 1 +No match + +/[^[\p{Ll}]&&[\p{Nd}]]/alt_extended_class + a + 0: a + 1 + 0: 1 + A + 0: A + +/(?[[\p{Ll}]+[\p{Nd}]])/ + a + 0: a + 1 + 0: 1 +\= Expect no match + [ +No match + ] +No match + +# -------------- + +# EXTENDED CHARACTER CLASSES (Perl) + +/(?[[\p{Ll}Z]&[\p{Lu}a]])/ + a + 0: a + Z + 0: Z +\= Expect no match + A +No match + z +No match + # -------------------------------------------------------------------------- # End of testinput7