Skip to content

Commit

Permalink
Implement Perl extended character classes
Browse files Browse the repository at this point in the history
  • Loading branch information
NWilson committed Nov 8, 2024
1 parent 6f36e8a commit 5385503
Show file tree
Hide file tree
Showing 13 changed files with 644 additions and 209 deletions.
8 changes: 5 additions & 3 deletions HACKING
Original file line number Diff line number Diff line change
Expand Up @@ -199,9 +199,11 @@ META_RANGE_ESCAPED hyphen in class range with at least one escape
META_RANGE_LITERAL hyphen in class range defined literally
META_SKIP (*SKIP) - no argument (see below for with argument)
META_THEN (*THEN) - no argument (see below for with argument)
META_ECLASS_OR || in an extended character class
META_ECLASS_AND && in an extended character class
META_ECLASS_SUB -- in an extended character class
META_ECLASS_AND && (or &) in an extended character class
META_ECLASS_OR || (or |, +) in an extended character class
META_ECLASS_SUB -- (or -) in an extended character class
META_ECLASS_XOR ~~ (or ^) in an extended character class
META_ECLASS_NOT ! in an extended character class

The two RANGE values occur only in character classes. They are positioned
between two literals that define the start and end of the range. In an EBCDIC
Expand Down
154 changes: 96 additions & 58 deletions doc/html/pcre2pattern.html

Large diffs are not rendered by default.

50 changes: 42 additions & 8 deletions doc/pcre2pattern.3
Original file line number Diff line number Diff line change
Expand Up @@ -1547,6 +1547,39 @@ the next two sections), and the terminating closing square bracket. However,
escaping other non-alphanumeric characters does no harm.
.
.
.SH "PERL EXTENDED CHARACTER CLASSES"
.rs
PCRE2 supports Perl's "(?[...])" extended character class syntax. This can
be used to perform set operations, such intersection.
.P
The syntax permitted within "(?[...])" is quite different to ordinary character
classes. Inside the extended class, there is an expression syntax consisting of
"atoms", operators, and ordinary parentheses "()" used for grouping. The allowed
atoms are any escaped characters or sets such as "\en" or "\ed", POSIX classes
such as "[:alpha:]", and any ordinary character class may be nested as an atom
within an extended class. For example, in "(?[\ed & [...]])" the nested ordinary
class "[...]" follows the ordinary rules for character classes, in which
parentheses are not metacharacters, and character literals and ranges are
permitted. However, when outside an ordinary character class (such as in "(?[...
+ (...)])") character literals and ranges may not be used, as they are not atoms
in the extended syntax. The extended syntax does not introduce any additional
escape sequences, so "(?[\ey])" is an unknown escape, as it would be inside
"[\ey]".
.P
In the extended syntax, ^ does not negate a class (except within an
ordinary class nested inside an extended class); it is instead a binary
operator.
.P
The binary operators are "&" (intersection), "|" or "+" (union), "-"
(subtraction) and "^" (symmetric difference). These are left-associative and
"&" has higher (tighter) precedence, while the others have equal lower
precedence. The one prefix unary operator is "!" (complement), with highest
precedence.
.P
A Perl extended character class always has the /xx modifier turned on within
it.
.
.
.SH "UTS#18 EXTENDED CHARACTER CLASSES"
.rs
The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's "(?[...])"
Expand All @@ -1560,18 +1593,19 @@ character becomes an additional metacharacter within classes, denoting the start
of a nested class, so a literal "[" must be escaped as "\e[".
.P
Secondly, within the UTS#18 extended syntax, there are additional operators
"||", "&&" and "--" which denote character class union, intersection, and
subtraction respectively. In standard Perl syntax, these would simply be
needlessly-repeated literals (except for "-" which can denote a range). These
operators can be used in constructs such as "[\ep{L}--[QW]]" for "Unicode
letters, other than Q and W". A literal "-" at the end of a range must be
escaped (so while "[--1]" in Perl syntax is the range from hyphen to "1", it
must be escaped as "[\e--1]" in UTS#18 extended classes).
"||", "&&", "--" and "~~" which denote character class union, intersection,
subtraction, and symmetric difference respectively. In standard Perl syntax,
these would simply be needlessly-repeated literals (except for "-" which can
denote a range). These operators can be used in constructs such as
"[\ep{L}--[QW]]" for "Unicode letters, other than Q and W". A literal "-" at
the end of a range must be escaped (so while "[--1]" in Perl syntax is the
range from hyphen to "1", it must be escaped as "[\e--1]" in UTS#18 extended
classes).
.P
The specific rules in PCRE2 are that classes can be nested:
"[...[B]...[^C]...]". The individual class items (literal characters, literal
ranges, properties such as \ed or \ep{...}, and nested classes) can be
combined by juxtaposition or by an operator "||", "&&", or "--".
combined by juxtaposition or by an operator "||", "&&", "--", or "~~".
Juxtaposition is the implicit union operator, and binds more tightly than any
explicit operator. Precedence between the explicit operators is not defined,
so mixing operators is a syntax error (thus "[A&&B--C]" is an error, but
Expand Down
4 changes: 4 additions & 0 deletions src/pcre2.h.generic
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,10 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210
#define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211
#define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213
#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214
#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216

/* "Expected" matching error codes: no match and partial match. */

Expand Down
4 changes: 4 additions & 0 deletions src/pcre2.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,10 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210
#define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211
#define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213
#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214
#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216

/* "Expected" matching error codes: no match and partial match. */

Expand Down
Loading

0 comments on commit 5385503

Please sign in to comment.