Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Perl extended character classes #553

Merged
merged 1 commit into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions HACKING
Original file line number Diff line number Diff line change
Expand Up @@ -199,9 +199,11 @@ META_RANGE_ESCAPED hyphen in class range with at least one escape
META_RANGE_LITERAL hyphen in class range defined literally
META_SKIP (*SKIP) - no argument (see below for with argument)
META_THEN (*THEN) - no argument (see below for with argument)
META_ECLASS_OR || in an extended character class
META_ECLASS_AND && in an extended character class
META_ECLASS_SUB -- in an extended character class
META_ECLASS_AND && (or &) in an extended character class
META_ECLASS_OR || (or |, +) in an extended character class
META_ECLASS_SUB -- (or -) in an extended character class
META_ECLASS_XOR ~~ (or ^) in an extended character class
META_ECLASS_NOT ! in an extended character class

The two RANGE values occur only in character classes. They are positioned
between two literals that define the start and end of the range. In an EBCDIC
Expand Down
193 changes: 119 additions & 74 deletions doc/html/pcre2pattern.html

Large diffs are not rendered by default.

133 changes: 86 additions & 47 deletions doc/html/pcre2syntax.html
Original file line number Diff line number Diff line change
Expand Up @@ -24,29 +24,30 @@ <h1>pcre2syntax man page</h1>
<li><a name="TOC9" href="#SEC9">SCRIPT MATCHING WITH \p AND \P</a>
<li><a name="TOC10" href="#SEC10">THE BIDI_CLASS PROPERTY FOR \p AND \P</a>
<li><a name="TOC11" href="#SEC11">CHARACTER CLASSES</a>
<li><a name="TOC12" href="#SEC12">QUANTIFIERS</a>
<li><a name="TOC13" href="#SEC13">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC14" href="#SEC14">REPORTED MATCH POINT SETTING</a>
<li><a name="TOC15" href="#SEC15">ALTERNATION</a>
<li><a name="TOC16" href="#SEC16">CAPTURING</a>
<li><a name="TOC17" href="#SEC17">ATOMIC GROUPS</a>
<li><a name="TOC18" href="#SEC18">COMMENT</a>
<li><a name="TOC19" href="#SEC19">OPTION SETTING</a>
<li><a name="TOC20" href="#SEC20">NEWLINE CONVENTION</a>
<li><a name="TOC21" href="#SEC21">WHAT \R MATCHES</a>
<li><a name="TOC22" href="#SEC22">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC23" href="#SEC23">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC24" href="#SEC24">SUBSTRING SCAN ASSERTION</a>
<li><a name="TOC25" href="#SEC25">SCRIPT RUNS</a>
<li><a name="TOC26" href="#SEC26">BACKREFERENCES</a>
<li><a name="TOC27" href="#SEC27">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC28" href="#SEC28">CONDITIONAL PATTERNS</a>
<li><a name="TOC29" href="#SEC29">BACKTRACKING CONTROL</a>
<li><a name="TOC30" href="#SEC30">CALLOUTS</a>
<li><a name="TOC31" href="#SEC31">REPLACEMENT STRINGS</a>
<li><a name="TOC32" href="#SEC32">SEE ALSO</a>
<li><a name="TOC33" href="#SEC33">AUTHOR</a>
<li><a name="TOC34" href="#SEC34">REVISION</a>
<li><a name="TOC12" href="#SEC12">PERL EXTENDED CHARACTER CLASSES</a>
<li><a name="TOC13" href="#SEC13">QUANTIFIERS</a>
<li><a name="TOC14" href="#SEC14">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC15" href="#SEC15">REPORTED MATCH POINT SETTING</a>
<li><a name="TOC16" href="#SEC16">ALTERNATION</a>
<li><a name="TOC17" href="#SEC17">CAPTURING</a>
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPS</a>
<li><a name="TOC19" href="#SEC19">COMMENT</a>
<li><a name="TOC20" href="#SEC20">OPTION SETTING</a>
<li><a name="TOC21" href="#SEC21">NEWLINE CONVENTION</a>
<li><a name="TOC22" href="#SEC22">WHAT \R MATCHES</a>
<li><a name="TOC23" href="#SEC23">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC24" href="#SEC24">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC25" href="#SEC25">SUBSTRING SCAN ASSERTION</a>
<li><a name="TOC26" href="#SEC26">SCRIPT RUNS</a>
<li><a name="TOC27" href="#SEC27">BACKREFERENCES</a>
<li><a name="TOC28" href="#SEC28">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC29" href="#SEC29">CONDITIONAL PATTERNS</a>
<li><a name="TOC30" href="#SEC30">BACKTRACKING CONTROL</a>
<li><a name="TOC31" href="#SEC31">CALLOUTS</a>
<li><a name="TOC32" href="#SEC32">REPLACEMENT STRINGS</a>
<li><a name="TOC33" href="#SEC33">SEE ALSO</a>
<li><a name="TOC34" href="#SEC34">AUTHOR</a>
<li><a name="TOC35" href="#SEC35">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P>
Expand Down Expand Up @@ -311,7 +312,45 @@ <h1>pcre2syntax man page</h1>
but some of them use Unicode properties if PCRE2_UCP is set. You can use
\Q...\E inside a character class.
</P>
<br><a name="SEC12" href="#TOC1">QUANTIFIERS</a><br>
<P>
When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be
used, allowing nested character classes, combined using set operators.
<pre>
[x&&[^y]] UTS#18 extended character class

x||y set union (OR)
x&&y set intersection (AND)
x--y set difference (AND NOT)
x~~y set symmetric difference (XOR)

</PRE>
</P>
<br><a name="SEC12" href="#TOC1">PERL EXTENDED CHARACTER CLASSES</a><br>
<P>
<pre>
(?[...]) Perl extended character class
(?[\p{Thai} & \p{Nd}]) operators; whitespace ignored
(?[(x - y) & z]) parentheses for grouping

(?[ [^3] & \p{Nd} ]) [...] is a nested ordinary class
(?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...]
(?[ \d - [3] ]) backslash-escaped set is allowed outside [...]
(?[ !\n & [:ascii:] ]) backslash-escaped character is allowed outside [...]
all other characters or ranges must be enclosed in [...]

x|y, x+y set union (OR)
x&y set intersection (AND)
x-y set difference (AND NOT)
x^y set symmetric difference (XOR)
!x set complement (NOT)
</pre>
Inside a Perl extended character class, [...] switches mode to be interpreted
as an ordinary character class. Outside of a nested [...], the only items
permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside
a nested ordinary class, ^ has its usual meaning (inverts the class when used
as the first character); outside of a nested class, ^ is the XOR operator.
</P>
<br><a name="SEC13" href="#TOC1">QUANTIFIERS</a><br>
<P>
<pre>
? 0 or 1, greedy
Expand All @@ -335,7 +374,7 @@ <h1>pcre2syntax man page</h1>
{,m}? zero up to m, lazy
</PRE>
</P>
<br><a name="SEC13" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<br><a name="SEC14" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<P>
<pre>
\b word boundary
Expand All @@ -353,7 +392,7 @@ <h1>pcre2syntax man page</h1>
\G first matching position in subject
</PRE>
</P>
<br><a name="SEC14" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
<br><a name="SEC15" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
<P>
<pre>
\K set reported start of match
Expand All @@ -363,13 +402,13 @@ <h1>pcre2syntax man page</h1>
option is set, the previous behaviour is re-enabled. When this option is set,
\K is honoured in positive assertions, but ignored in negative ones.
</P>
<br><a name="SEC15" href="#TOC1">ALTERNATION</a><br>
<br><a name="SEC16" href="#TOC1">ALTERNATION</a><br>
<P>
<pre>
expr|expr|expr...
</PRE>
</P>
<br><a name="SEC16" href="#TOC1">CAPTURING</a><br>
<br><a name="SEC17" href="#TOC1">CAPTURING</a><br>
<P>
<pre>
(...) capture group
Expand All @@ -384,20 +423,20 @@ <h1>pcre2syntax man page</h1>
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
both cases, a name must not start with a digit.
</P>
<br><a name="SEC17" href="#TOC1">ATOMIC GROUPS</a><br>
<br><a name="SEC18" href="#TOC1">ATOMIC GROUPS</a><br>
<P>
<pre>
(?&#62;...) atomic non-capture group
(*atomic:...) atomic non-capture group
</PRE>
</P>
<br><a name="SEC18" href="#TOC1">COMMENT</a><br>
<br><a name="SEC19" href="#TOC1">COMMENT</a><br>
<P>
<pre>
(?#....) comment (not nestable)
</PRE>
</P>
<br><a name="SEC19" href="#TOC1">OPTION SETTING</a><br>
<br><a name="SEC20" href="#TOC1">OPTION SETTING</a><br>
<P>
Changes of these options within a group are automatically cancelled at the end
of the group.
Expand Down Expand Up @@ -456,7 +495,7 @@ <h1>pcre2syntax man page</h1>
application can lock out the use of (*UTF) and (*UCP) by setting the
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
</P>
<br><a name="SEC20" href="#TOC1">NEWLINE CONVENTION</a><br>
<br><a name="SEC21" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
These are recognized only at the very start of the pattern or after option
settings with a similar syntax.
Expand All @@ -469,7 +508,7 @@ <h1>pcre2syntax man page</h1>
(*NUL) the NUL character (binary zero)
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">WHAT \R MATCHES</a><br>
<br><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
These are recognized only at the very start of the pattern or after option
setting with a similar syntax.
Expand All @@ -478,7 +517,7 @@ <h1>pcre2syntax man page</h1>
(*BSR_UNICODE) any Unicode newline sequence
</PRE>
</P>
<br><a name="SEC22" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<br><a name="SEC23" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) )
Expand All @@ -504,7 +543,7 @@ <h1>pcre2syntax man page</h1>
(ultimate default 255). If every branch matches a fixed number of characters,
the limit for each branch is 65535 characters.
</P>
<br><a name="SEC23" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<br><a name="SEC24" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<P>
These assertions are specific to PCRE2 and are not Perl-compatible.
<pre>
Expand All @@ -517,7 +556,7 @@ <h1>pcre2syntax man page</h1>
(*non_atomic_positive_lookbehind:...) )
</PRE>
</P>
<br><a name="SEC24" href="#TOC1">SUBSTRING SCAN ASSERTION</a><br>
<br><a name="SEC25" href="#TOC1">SUBSTRING SCAN ASSERTION</a><br>
<P>
This feature is not Perl-compatible.
<pre>
Expand All @@ -534,7 +573,7 @@ <h1>pcre2syntax man page</h1>

</PRE>
</P>
<br><a name="SEC25" href="#TOC1">SCRIPT RUNS</a><br>
<br><a name="SEC26" href="#TOC1">SCRIPT RUNS</a><br>
<P>
<pre>
(*script_run:...) ) script run, can be backtracked into
Expand All @@ -544,7 +583,7 @@ <h1>pcre2syntax man page</h1>
(*asr:...) )
</PRE>
</P>
<br><a name="SEC26" href="#TOC1">BACKREFERENCES</a><br>
<br><a name="SEC27" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
Expand All @@ -561,7 +600,7 @@ <h1>pcre2syntax man page</h1>
(?P=name) reference by name (Python)
</PRE>
</P>
<br><a name="SEC27" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<br><a name="SEC28" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
Expand All @@ -580,7 +619,7 @@ <h1>pcre2syntax man page</h1>
\g'-n' call subroutine by relative number (PCRE2 extension)
</PRE>
</P>
<br><a name="SEC28" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<br><a name="SEC29" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
Expand All @@ -603,7 +642,7 @@ <h1>pcre2syntax man page</h1>
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
</P>
<br><a name="SEC29" href="#TOC1">BACKTRACKING CONTROL</a><br>
<br><a name="SEC30" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
Expand All @@ -630,7 +669,7 @@ <h1>pcre2syntax man page</h1>
The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call.
</P>
<br><a name="SEC30" href="#TOC1">CALLOUTS</a><br>
<br><a name="SEC31" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
(?C) callout (assumed number 0)
Expand All @@ -641,7 +680,7 @@ <h1>pcre2syntax man page</h1>
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
</P>
<br><a name="SEC31" href="#TOC1">REPLACEMENT STRINGS</a><br>
<br><a name="SEC32" href="#TOC1">REPLACEMENT STRINGS</a><br>
<P>
If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for
<b>pcre2_substitute()</b> is not interpreted. Otherwise, by default, the only
Expand Down Expand Up @@ -687,12 +726,12 @@ <h1>pcre2syntax man page</h1>
The substitution strings themselves are expanded. Backslash can be used to
escape colons and closing curly brackets.
</P>
<br><a name="SEC32" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC33" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC33" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC34" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Expand All @@ -701,9 +740,9 @@ <h1>pcre2syntax man page</h1>
Cambridge, England.
<br>
</P>
<br><a name="SEC34" href="#TOC1">REVISION</a><br>
<br><a name="SEC35" href="#TOC1">REVISION</a><br>
<P>
Last updated: 20 October 2024
Last updated: 08 November 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
Expand Down
Loading
Loading