-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode regex negated case-insensitivity in 5.14.0-RC1 #11301
Comments
From @khwilliamsonThis is a bug report for perl from public@khwilliamson.com, /^[^\x00-\x1f\x7f-\xff :]+:/ works very counterintuitively, because the case fold of \xdf is 'ss', There has been extensive discussion on p5p beginning with: Flags: Site configuration information for perl 5.14.0: Configured by khw at Tue May 3 08:46:17 MDT 2011. Summary of my perl5 (revision 5 version 14 subversion 0) configuration: Locally applied patches: @INC for perl 5.14.0: /home/khw/blead/lib/perl5/site_perl/5.14.0/i686-linux-thread-multi-64int-ld Environment for perl 5.14.0: PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin |
From @khwilliamsonAttached are the commits for review for fixing the code and tests for |
From @khwilliamson0001-embed.fnc-Allow-NULL-arg-to-to_utf8_case.patchFrom 84805f7b746dd40931deca68602035305760e7f5 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 3 May 2011 09:52:49 -0600
Subject: [PATCH 1/4] embed.fnc: Allow NULL arg to to_utf8_case()
Code within the function doesn't assume that the parameter is non-null,
and in fact the specials are retrieved by swash_init(). Having the
parameter null just means that no specials will be retrieved in the
current call.
---
embed.fnc | 2 +-
proto.h | 5 ++---
2 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/embed.fnc b/embed.fnc
index b891b43..288dacd 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1318,7 +1318,7 @@ EsMR |HV* |invlist_union |NN HV* const a|NN HV* const b
Ap |void |taint_env
Ap |void |taint_proper |NULLOK const char* f|NN const char *const s
Apd |UV |to_utf8_case |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp \
- |NN SV **swashp|NN const char *normal|NN const char *special
+ |NN SV **swashp|NN const char *normal|NULLOK const char *special
Apd |UV |to_utf8_lower |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
Apd |UV |to_utf8_upper |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
Apd |UV |to_utf8_title |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
diff --git a/proto.h b/proto.h
index a8c066a..0553531 100644
--- a/proto.h
+++ b/proto.h
@@ -4253,10 +4253,9 @@ PERL_CALLCONV UV Perl_to_utf8_case(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, S
__attribute__nonnull__(pTHX_1)
__attribute__nonnull__(pTHX_2)
__attribute__nonnull__(pTHX_4)
- __attribute__nonnull__(pTHX_5)
- __attribute__nonnull__(pTHX_6);
+ __attribute__nonnull__(pTHX_5);
#define PERL_ARGS_ASSERT_TO_UTF8_CASE \
- assert(p); assert(ustrp); assert(swashp); assert(normal); assert(special)
+ assert(p); assert(ustrp); assert(swashp); assert(normal)
PERL_CALLCONV UV Perl_to_utf8_fold(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp)
__attribute__nonnull__(pTHX_1)
--
1.7.1
|
From @khwilliamson0002-utf8.c-Add-_flags-version-of-to_utf8_fold.patchFrom 2bb975892d97642cd620aa9a51c268131ba66681 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 3 May 2011 10:12:00 -0600
Subject: [PATCH 2/4] utf8.c: Add _flags version of to_utf8_fold()
And also to_uni_fold().
The flag allows retrieving either simple or full folds.
The interface is subject to change, so these are marked experimental
and their names begin with underscore. The old versions are turned
into macros calling the new versions with the correct extra parameter.
---
embed.fnc | 6 ++++--
embed.h | 4 ++--
global.sym | 4 ++--
proto.h | 24 ++++++++++++++++--------
utf8.c | 19 ++++++++++++-------
utf8.h | 3 +++
6 files changed, 39 insertions(+), 21 deletions(-)
diff --git a/embed.fnc b/embed.fnc
index 288dacd..65116ad 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -562,7 +562,8 @@ ApPR |bool |is_uni_xdigit |UV c
Ap |UV |to_uni_upper |UV c|NN U8 *p|NN STRLEN *lenp
Ap |UV |to_uni_title |UV c|NN U8 *p|NN STRLEN *lenp
Ap |UV |to_uni_lower |UV c|NN U8 *p|NN STRLEN *lenp
-Ap |UV |to_uni_fold |UV c|NN U8 *p|NN STRLEN *lenp
+Amp |UV |to_uni_fold |UV c|NN U8 *p|NN STRLEN *lenp
+AMp |UV |_to_uni_fold_flags|UV c|NN U8 *p|NN STRLEN *lenp|U8 flags
ApPR |bool |is_uni_alnum_lc|UV c
ApPR |bool |is_uni_idfirst_lc|UV c
ApPR |bool |is_uni_alpha_lc|UV c
@@ -1322,7 +1323,8 @@ Apd |UV |to_utf8_case |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp \
Apd |UV |to_utf8_lower |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
Apd |UV |to_utf8_upper |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
Apd |UV |to_utf8_title |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
-Apd |UV |to_utf8_fold |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
+Ampd |UV |to_utf8_fold |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
+AMp |UV |_to_utf8_fold_flags|NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp|U8 flags
#if defined(UNLINK_ALL_VERSIONS)
Ap |I32 |unlnk |NN const char* f
#endif
diff --git a/embed.h b/embed.h
index 89c4fa8..9ff6440 100644
--- a/embed.h
+++ b/embed.h
@@ -27,6 +27,8 @@
/* Hide global symbols */
#define Gv_AMupdate(a,b) Perl_Gv_AMupdate(aTHX_ a,b)
+#define _to_uni_fold_flags(a,b,c,d) Perl__to_uni_fold_flags(aTHX_ a,b,c,d)
+#define _to_utf8_fold_flags(a,b,c,d) Perl__to_utf8_fold_flags(aTHX_ a,b,c,d)
#define amagic_call(a,b,c,d) Perl_amagic_call(aTHX_ a,b,c,d)
#define amagic_deref_call(a,b) Perl_amagic_deref_call(aTHX_ a,b)
#define apply_attrs_string(a,b,c,d) Perl_apply_attrs_string(aTHX_ a,b,c,d)
@@ -623,7 +625,6 @@
#define taint_env() Perl_taint_env(aTHX)
#define taint_proper(a,b) Perl_taint_proper(aTHX_ a,b)
#define tmps_grow(a) Perl_tmps_grow(aTHX_ a)
-#define to_uni_fold(a,b,c) Perl_to_uni_fold(aTHX_ a,b,c)
#define to_uni_lower(a,b,c) Perl_to_uni_lower(aTHX_ a,b,c)
#define to_uni_lower_lc(a) Perl_to_uni_lower_lc(aTHX_ a)
#define to_uni_title(a,b,c) Perl_to_uni_title(aTHX_ a,b,c)
@@ -631,7 +632,6 @@
#define to_uni_upper(a,b,c) Perl_to_uni_upper(aTHX_ a,b,c)
#define to_uni_upper_lc(a) Perl_to_uni_upper_lc(aTHX_ a)
#define to_utf8_case(a,b,c,d,e,f) Perl_to_utf8_case(aTHX_ a,b,c,d,e,f)
-#define to_utf8_fold(a,b,c) Perl_to_utf8_fold(aTHX_ a,b,c)
#define to_utf8_lower(a,b,c) Perl_to_utf8_lower(aTHX_ a,b,c)
#define to_utf8_title(a,b,c) Perl_to_utf8_title(aTHX_ a,b,c)
#define to_utf8_upper(a,b,c) Perl_to_utf8_upper(aTHX_ a,b,c)
diff --git a/global.sym b/global.sym
index dde11d4..89fb825 100644
--- a/global.sym
+++ b/global.sym
@@ -21,6 +21,8 @@ Perl__append_range_to_invlist
Perl__new_invlist
Perl__swash_inversion_hash
Perl__swash_to_invlist
+Perl__to_uni_fold_flags
+Perl__to_utf8_fold_flags
Perl_amagic_call
Perl_amagic_deref_call
Perl_apply_attrs_string
@@ -732,7 +734,6 @@ Perl_sys_term
Perl_taint_env
Perl_taint_proper
Perl_tmps_grow
-Perl_to_uni_fold
Perl_to_uni_lower
Perl_to_uni_lower_lc
Perl_to_uni_title
@@ -740,7 +741,6 @@ Perl_to_uni_title_lc
Perl_to_uni_upper
Perl_to_uni_upper_lc
Perl_to_utf8_case
-Perl_to_utf8_fold
Perl_to_utf8_lower
Perl_to_utf8_title
Perl_to_utf8_upper
diff --git a/proto.h b/proto.h
index 0553531..c83fd12 100644
--- a/proto.h
+++ b/proto.h
@@ -43,6 +43,18 @@ PERL_CALLCONV HV* Perl__swash_to_invlist(pTHX_ SV* const swash)
#define PERL_ARGS_ASSERT__SWASH_TO_INVLIST \
assert(swash)
+PERL_CALLCONV UV Perl__to_uni_fold_flags(pTHX_ UV c, U8 *p, STRLEN *lenp, U8 flags)
+ __attribute__nonnull__(pTHX_2)
+ __attribute__nonnull__(pTHX_3);
+#define PERL_ARGS_ASSERT__TO_UNI_FOLD_FLAGS \
+ assert(p); assert(lenp)
+
+PERL_CALLCONV UV Perl__to_utf8_fold_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, U8 flags)
+ __attribute__nonnull__(pTHX_1)
+ __attribute__nonnull__(pTHX_2);
+#define PERL_ARGS_ASSERT__TO_UTF8_FOLD_FLAGS \
+ assert(p); assert(ustrp)
+
PERL_CALLCONV PADOFFSET Perl_allocmy(pTHX_ const char *const name, const STRLEN len, const U32 flags)
__attribute__nonnull__(pTHX_1);
#define PERL_ARGS_ASSERT_ALLOCMY \
@@ -4213,11 +4225,9 @@ PERL_CALLCONV OP * Perl_tied_method(pTHX_ const char *const methname, SV **sp, S
assert(methname); assert(sp); assert(sv); assert(mg)
PERL_CALLCONV void Perl_tmps_grow(pTHX_ I32 n);
-PERL_CALLCONV UV Perl_to_uni_fold(pTHX_ UV c, U8 *p, STRLEN *lenp)
+/* PERL_CALLCONV UV Perl_to_uni_fold(pTHX_ UV c, U8 *p, STRLEN *lenp)
__attribute__nonnull__(pTHX_2)
- __attribute__nonnull__(pTHX_3);
-#define PERL_ARGS_ASSERT_TO_UNI_FOLD \
- assert(p); assert(lenp)
+ __attribute__nonnull__(pTHX_3); */
PERL_CALLCONV UV Perl_to_uni_lower(pTHX_ UV c, U8 *p, STRLEN *lenp)
__attribute__nonnull__(pTHX_2)
@@ -4257,11 +4267,9 @@ PERL_CALLCONV UV Perl_to_utf8_case(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, S
#define PERL_ARGS_ASSERT_TO_UTF8_CASE \
assert(p); assert(ustrp); assert(swashp); assert(normal)
-PERL_CALLCONV UV Perl_to_utf8_fold(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp)
+/* PERL_CALLCONV UV Perl_to_utf8_fold(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp)
__attribute__nonnull__(pTHX_1)
- __attribute__nonnull__(pTHX_2);
-#define PERL_ARGS_ASSERT_TO_UTF8_FOLD \
- assert(p); assert(ustrp)
+ __attribute__nonnull__(pTHX_2); */
PERL_CALLCONV UV Perl_to_utf8_lower(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp)
__attribute__nonnull__(pTHX_1)
diff --git a/utf8.c b/utf8.c
index 9c2061d..11c2fa4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -1341,12 +1341,12 @@ Perl_to_uni_lower(pTHX_ UV c, U8* p, STRLEN *lenp)
}
UV
-Perl_to_uni_fold(pTHX_ UV c, U8* p, STRLEN *lenp)
+Perl__to_uni_fold_flags(pTHX_ UV c, U8* p, STRLEN *lenp, U8 flags)
{
- PERL_ARGS_ASSERT_TO_UNI_FOLD;
+ PERL_ARGS_ASSERT__TO_UNI_FOLD_FLAGS;
uvchr_to_utf8(p, c);
- return to_utf8_fold(p, p, lenp);
+ return _to_utf8_fold_flags(p, p, lenp, flags);
}
/* for now these all assume no locale info available for Unicode > 255 */
@@ -1799,7 +1799,7 @@ of the result.
The "swashp" is a pointer to the swash to use.
-Both the special and normal mappings are stored lib/unicore/To/Foo.pl,
+Both the special and normal mappings are stored in lib/unicore/To/Foo.pl,
and loaded by SWASHNEW, using lib/utf8_heavy.pl. The special (usually,
but not always, a multicharacter mapping), is tried first.
@@ -2026,15 +2026,20 @@ The first character of the foldcased version is returned
=cut */
+/* Not currently externally documented is 'flags', which currently is non-zero
+ * if full case folds are to be used; otherwise simple folds */
+
UV
-Perl_to_utf8_fold(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp)
+Perl__to_utf8_fold_flags(pTHX_ const U8 *p, U8* ustrp, STRLEN *lenp, U8 flags)
{
+ const char *specials = (flags) ? "utf8::ToSpecFold" : NULL;
+
dVAR;
- PERL_ARGS_ASSERT_TO_UTF8_FOLD;
+ PERL_ARGS_ASSERT__TO_UTF8_FOLD_FLAGS;
return Perl_to_utf8_case(aTHX_ p, ustrp, lenp,
- &PL_utf8_tofold, "ToFold", "utf8::ToSpecFold");
+ &PL_utf8_tofold, "ToFold", specials);
}
/* Note:
diff --git a/utf8.h b/utf8.h
index a08ba04..c40fb58 100644
--- a/utf8.h
+++ b/utf8.h
@@ -16,6 +16,9 @@
# define USE_UTF8_IN_NAMES (PL_hints & HINT_UTF8)
#endif
+#define to_uni_fold(c, p, lenp) _to_uni_fold_flags(c, p, lenp, 1)
+#define to_utf8_fold(c, p, lenp) _to_utf8_fold_flags(c, p, lenp, 1)
+
/* Source backward compatibility. */
#define uvuni_to_utf8(d, uv) uvuni_to_utf8_flags(d, uv, 0)
#define is_utf8_string_loc(s, len, ep) is_utf8_string_loclen(s, len, ep, 0)
--
1.7.1
|
From @khwilliamson0003-PATCH-perl-89750-Unicode-regex-negated-case-insensit.patchFrom b5e0a1ac3277e2f51c551c5eff98bca12e8f0546 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 3 May 2011 11:44:28 -0600
Subject: [PATCH 3/4] PATCH: [perl #89750]: Unicode regex negated case-insensitivity
This patch causes inverted [bracketed] character classes to not handle
multi-character folds. The reason is that these can lead to very
counter-intuitive results (see bug discussion).
In an inverted character class, only single-char folds are now
generated. However the fold for \xDF=>ss is hard-coded in,
and it was too much trouble sending flags to the sub-sub routine that
does this, so another check is done at the point of storing the list of
multi-char folds. Since \xDF doesn't have a single char fold, this
works.
---
regcomp.c | 22 +++++++++++++++++++++-
t/re/fold_grind.t | 2 ++
t/re/re_tests | 5 +++++
3 files changed, 28 insertions(+), 1 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index 0858841..59397a2 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -9552,6 +9552,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, U32 depth)
IV namedclass;
char *rangebegin = NULL;
bool need_class = 0;
+ bool allow_full_fold = TRUE; /* Assume wants multi-char folding */
SV *listsv = NULL;
STRLEN initial_listsv_len = 0; /* Kind of a kludge to see if it is more
than just initialized. */
@@ -9608,6 +9609,16 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, U32 depth)
RExC_parse++;
if (!SIZE_ONLY)
ANYOF_FLAGS(ret) |= ANYOF_INVERT;
+
+ /* We have decided to not allow multi-char folds in inverted character
+ * classes, due to the confusion that can happen, even with classes
+ * that are designed for a non-Unicode world: You have the peculiar
+ * case that:
+ "s s" =~ /^[^\xDF]+$/i => Y
+ "ss" =~ /^[^\xDF]+$/i => N
+ *
+ * See [perl #89750] */
+ allow_full_fold = FALSE;
}
if (SIZE_ONLY) {
@@ -10136,7 +10147,8 @@ parseit:
/* Get its fold */
U8 foldbuf[UTF8_MAXBYTES_CASE+1];
STRLEN foldlen;
- const UV f = to_uni_fold(j, foldbuf, &foldlen);
+ const UV f =
+ _to_uni_fold_flags(j, foldbuf, &foldlen, allow_full_fold);
if (foldlen > (STRLEN)UNISKIP(f)) {
@@ -10437,10 +10449,18 @@ parseit:
* used later (regexec.c:S_reginclass()). */
av_store(av, 0, listsv);
av_store(av, 1, NULL);
+
+ /* Store any computed multi-char folds only if we are allowing
+ * them */
+ if (allow_full_fold) {
av_store(av, 2, MUTABLE_SV(unicode_alternate));
if (unicode_alternate) { /* This node is variable length */
OP(ret) = ANYOFV;
}
+ }
+ else {
+ av_store(av, 2, NULL);
+ }
rv = newRV_noinc(MUTABLE_SV(av));
n = add_data(pRExC_state, 1, "s");
RExC_rxi->data->data[n] = (void*)rv;
diff --git a/t/re/fold_grind.t b/t/re/fold_grind.t
index 82ca6ad..460d296 100644
--- a/t/re/fold_grind.t
+++ b/t/re/fold_grind.t
@@ -452,6 +452,8 @@ foreach my $test (sort { numerically } keys %tests) {
foreach my $bracketed (0, 1) { # Put rhs in [...], or not
foreach my $inverted (0,1) {
next if $inverted && ! $bracketed; # inversion only valid in [^...]
+ next if $inverted && @target != 1; # [perl #89750] multi-char
+ # not valid in [^...]
# In some cases, add an extra character that doesn't fold, and
# looks ok in the output.
diff --git a/t/re/re_tests b/t/re/re_tests
index 9d5341b..35a7220 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1517,4 +1517,9 @@ abc\N{def - c - \\N{NAME} must be resolved by the lexer
/s/aia S y $& S
/(?aia:s)/ \x{17F} n - -
/(?aia:s)/ S y $& S
+
+# Normally 1E9E generates a multi-char fold, but not in inverted class;
+# See [perl #89750]. This makes sure that the simple fold gets generated
+# in that case, to DF.
+/[^\x{1E9E}]/i \x{DF} n - -
# vim: softtabstop=0 noexpandtab
--
1.7.1
|
From @khwilliamson0004-regcomp.c-White-space-only.patchFrom 96ab5121598b71bdfa033c120c130fcb9bd0e586 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 3 May 2011 11:47:50 -0600
Subject: [PATCH 4/4] regcomp.c: White space only
A previous commit added an 'if' around this code. This now indents
the block properly.
---
regcomp.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index 59397a2..1094789 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -10453,10 +10453,10 @@ parseit:
/* Store any computed multi-char folds only if we are allowing
* them */
if (allow_full_fold) {
- av_store(av, 2, MUTABLE_SV(unicode_alternate));
- if (unicode_alternate) { /* This node is variable length */
- OP(ret) = ANYOFV;
- }
+ av_store(av, 2, MUTABLE_SV(unicode_alternate));
+ if (unicode_alternate) { /* This node is variable length */
+ OP(ret) = ANYOFV;
+ }
}
else {
av_store(av, 2, NULL);
--
1.7.1
|
From @khwilliamsonAttached |
From @khwilliamson0005-Doc-changes-for-perl-89750.patchFrom 94b15be5e73d15b66e96674b06f786485ceaa01b Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 3 May 2011 14:08:43 -0600
Subject: [PATCH 5/5] Doc changes for [perl #89750]
---
pod/perldelta.pod | 25 +++++++++++++++++++++++++
pod/perlre.pod | 6 +++++-
pod/perlrecharclass.pod | 32 +++++++++++++++++++++++++++++---
3 files changed, 59 insertions(+), 4 deletions(-)
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 817e84f..6855f98 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -56,6 +56,8 @@ This release provides full functionality for C<use feature
'unicode_strings'>. Under its scope, all string operations executed and
regular expressions compiled (even if executed outside its scope) have
Unicode semantics. See L<feature/"the 'unicode_strings' feature">.
+However, see L</Inverted bracketed character classes and multi-character folds>,
+below.
This feature avoids most forms of the "Unicode Bug" (see
L<perlunicode/The "Unicode Bug"> for details). If there is any
@@ -529,6 +531,29 @@ In addition to the sections that follow, see L</C API Changes>.
=head2 Regular Expressions and String Escapes
+=head3 Inverted bracketed character classes and multi-character folds
+
+Some characters match a sequence of two or three characters in C</i>
+regular expression matching under Unicode rules. One example is
+C<LATIN SMALL LETTER SHARP S> which matches the sequence C<ss>.
+
+ 'ss' =~ /\A[\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches
+
+This, however, can lead to very counter-intuitive results, especially
+when inverted. Because of this, Perl 5.14 does not use multi-character C</i>
+matching in inverted character classes.
+
+ 'ss' =~ /\A[^\N{LATIN SMALL LETTER SHARP S}]+\z/i # ???
+
+This should match any sequences of characters that aren't the C<SHARP S>
+nor what C<SHARP S> matches under C</i>. C<"s"> isn't C<SHARP S>, but
+Unicode says that C<"ss"> is what C<SHARP S> matches under C</i>. So
+which one "wins"? Do you fail the match because the string has C<ss> or
+accept it because it has an C<s> followed by another C<s>?
+
+Earlier releases of Perl did allow this multi-character matching,
+but due to bugs, it mostly did not work.
+
=head3 \400-\777
In certain circumstances, C<\400>-C<\777> in regexes have behaved
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 12617e2..c4ec417 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -72,7 +72,11 @@ are split between groupings, or when one or more are quantified. Thus
# be even if it did!!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
-Also, this matching doesn't fully conform to the current Unicode
+Perl doesn't match multiple characters in an inverted bracketed
+character class, which otherwise could be highly confusing. See
+L<perlrecharclass/Negation>.
+
+Also, Perl matching doesn't fully conform to the current Unicode C</i>
recommendations, which ask that the matching be made upon the NFD
(Normalization Form Decomposed) of the text. However, Unicode is
in the process of reconsidering and revising their recommendations.
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 4c91931..2b76dfb 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -401,7 +401,7 @@ The third form of character class you can use in Perl regular expressions
is the bracketed character class. In its simplest form, it lists the characters
that may be matched, surrounded by square brackets, like this: C<[aeiou]>.
This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other
-character classes, exactly one character is matched. To match
+character classes, exactly one character is matched.* To match
a longer string consisting of characters mentioned in the character
class, follow the character class with a L<quantifier|perlre/Quantifiers>. For
instance, C<[aeiou]+> matches one or more lowercase English vowels.
@@ -417,6 +417,19 @@ Examples:
# a single character.
"ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
+ -------
+
+* There is an exception to a bracketed character class matching a only a
+single character. When the class is to match caselessely under C</i>
+matching rules, and a character inside the class matches a
+multiple-character sequence caselessly under Unicode rules, the class
+(when not L<inverted|/Negation>) will also match that sequence. For
+example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S>
+should match the sequence C<ss> under C</i> rules. Thus,
+
+ 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches
+ 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches
+
=head3 Special Characters Inside a Bracketed Character Class
Most characters that are meta characters in regular expressions (that
@@ -525,13 +538,26 @@ It is also possible to instead list the characters you do not want to
match. You can do so by using a caret (C<^>) as the first character in the
character class. For instance, C<[^a-z]> matches any character that is not a
lowercase ASCII letter, which therefore includes almost a hundred thousand
-Unicode letters.
+Unicode letters. The class is said to be "negated" or "inverted".
This syntax make the caret a special character inside a bracketed character
class, but only if it is the first character of the class. So if you want
the caret as one of the characters to match, either escape the caret or
else not list it first.
+In inverted bracketed character classes, Perl ignores the Unicode rules
+that normally say that a given character matches a sequence of multiple
+characters under caseless C</i> matching, which otherwise could be
+highly confusing:
+
+ "ss" =~ /^[^\xDF]+$/ui;
+
+This should match any sequences of characters that aren't C<\xDF> nor
+what C<\xDF> matches under C</i>. C<"s"> isn't C<\xDF>, but Unicode
+says that C<"ss"> is what C<\xDF> matches under C</i>. So which one
+"wins"? Do you fail the match because the string has C<ss> or accept it
+because it has an C<s> followed by another C<s>?
+
Examples:
"e" =~ /[^aeiou]/ # No match, the 'e' is listed.
@@ -765,7 +791,7 @@ C<\p{HorizSpace}> and \C<\p{XPosixBlank}>. For example,
C<\p{PosixAlpha}> can be written as C<\p{Alpha}>. All are listed
in L<perluniprops/Properties accessible through \p{} and \P{}>.
-=head4 Negation
+=head4 Negation of POSIX character classes
X<character class, negation>
A Perl extension to the POSIX character class is the ability to
--
1.7.1
|
From tchrist@perl.com+ /* We have decided to not allow multi-char folds in inverted character That's a very good observations, although I might have said not "even with classes designed for a non-Unicode world" but rather "especially with classes designed for a non-Unicode world" Because I don't believe that charclasses in a non-Unicode world Someday we will have a way for "." to match "\X" -- which is trivial, --tom |
The RT System itself - Status changed from 'new' to 'open' |
From tchrist@perl.comKarl Williamson <public@khwilliamson.com> wrote +* There is an exception to a bracketed character class matching a only a Either of these works, although the second is stronger: matching only a single character. matching a single character only. --tom |
From @khwilliamsonOn 05/03/2011 03:19 PM, Tom Christiansen wrote:
Stolen from George Greer
I'll change the wording in 5.15
I don't think so. |
From @khwilliamsonFixed by 827f5bb |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#89750 (status was 'resolved')
Searchable as RT89750$
The text was updated successfully, but these errors were encountered: