Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex bug: Unicode hex ranges not supported #46137

Open
ProvocaTeach opened this issue Jul 22, 2022 · 6 comments
Open

Regex bug: Unicode hex ranges not supported #46137

ProvocaTeach opened this issue Jul 22, 2022 · 6 comments
Labels
docs This change adds or pertains to documentation strings "Strings!" unicode Related to unicode characters and encodings

Comments

@ProvocaTeach
Copy link
Contributor

ProvocaTeach commented Jul 22, 2022

Trying to form Unicode hex ranges in a regular expression causes a LoadError:

julia> r"[\x{00A0}-\x{10FFFD}]"

yields

ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[45]:1

The result should be a regex that matches all Unicode codepoints from U+00A0 to U+10FFFD.
Julia version: 1.7.3

@fredrikekre
Copy link
Member

man pcre says it has to be valid Unicode points, but that range have a bunch of invalid ones:

julia> count(x -> !isvalid(Char(x)), 0x00A0:0x10FFFD)
2048

In any case, if this is indeed a bug it is a bug in PCRE2 and not Julia.

@ProvocaTeach
Copy link
Contributor Author

ProvocaTeach commented Jul 24, 2022

Unfortunately, the bug applies to all Unicode ranges, not just ones with invalid characters. Even simply typing

julia> r"[\x{00A0}-\x{00A5}]"

throws a LoadError:

ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[94]:1

Please reopen this issue?

@fredrikekre
Copy link
Member

It is still an error from PCRE, but this one doesn't show up in other environments (e.g. https://regex101.com/) so perhaps there is some compile setting that is different.

@fredrikekre fredrikekre reopened this Jul 24, 2022
@inkydragon
Copy link
Member

Workaround: \N{U+XXXX}

The escape sequence \N{U+} is recognized as another way of specifying a Unicode character by code point in a UTF mode.
https://www.pcre.org/current/doc/html/pcre2unicode.html

julia> ''
'': Unicode U+548C (category Lo: Letter, other)

julia> ''
'': Unicode U+5E73 (category Lo: Letter, other)

julia> contains("", r"[\N{U+548C}-\N{U+5E73}]")
true

julia> contains("", r"[\N{U+548C}-\N{U+5E73}]")
true

julia> contains("aaa", r"[\N{U+548C}-\N{U+5E73}]")
false

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

@inkydragon inkydragon added the external dependencies Involves LLVM, OpenBLAS, or other linked libraries label Jul 31, 2022
@mgkuhn mgkuhn added docs This change adds or pertains to documentation unicode Related to unicode characters and encodings strings "Strings!" and removed external dependencies Involves LLVM, OpenBLAS, or other linked libraries labels Aug 3, 2022
@mgkuhn
Copy link
Contributor

mgkuhn commented Aug 3, 2022

SOLUTION: The command

julia> r"[\x{00A0}-\x{10FFFD}]"

is short for

julia> using Base.PCRE
julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
             PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.ALT_BSUX | PCRE.UCP,
             PCRE.NO_UTF_CHECK)
ERROR: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] top-level scope
   @ REPL[4]:1

In the latter form, we can play with the compile and match option flags that are passed to the PCRE2 library to specify what flavour of regular-expression behaviour exactly we want.

Doing that, I quickly found that dropping the PCRE.ALT_BSUX compile option suppresses this compilation error:

julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
                    PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.UCP,
                    PCRE.NO_UTF_CHECK)
Regex("[\\x{00A0}-\\x{10FFFD}]",0x040a0000)

Now it is time to actually read the PCRE2 documentation:

man pcre2
man pcre2api

There we find indeed the answer:

         PCRE2_ALT_BSUX

       This  option  request  alternative  handling of three escape sequences,
       which makes PCRE2's behaviour more like  ECMAscript  (aka  JavaScript).
       When it is set:

       (1) \U matches an upper case "U" character; by default \U causes a com‐
       pile time error (Perl uses \U to upper case subsequent characters).

       (2) \u matches a lower case "u" character unless it is followed by four
       hexadecimal  digits,  in  which case the hexadecimal number defines the
       code point to match. By default, \u causes a compile time  error  (Perl
       uses it to upper case the following character).

       (3)  \x matches a lower case "x" character unless it is followed by two
       hexadecimal digits, in which case the hexadecimal  number  defines  the
       code  point  to  match. By default, as in Perl, a hexadecimal number is
       always expected after \x, but it may have zero, one, or two digits (so,
       for example, \xz matches a binary zero character followed by z).

       ECMAscript 6 added additional functionality to \u. This can be accessed
       using the PCRE2_EXTRA_ALT_BSUX extra option  (see  "Extra  compile  op‐
       tions" below).  Note that this alternative escape handling applies only
       to patterns. Neither of these options affects  the  processing  of  re‐
       placement strings passed to pcre2_substitute().

In other words, Julia asks PCRE2 to implement a slightly more JavaScript-compatible version of regular expressions than the more Perl-compatible flavor it would have given us by default. The man page doesn't explicitly say so, but the way I read it, \x{xxxx} seems not part of the ECMAscript syntax, and is in fact therefore identical to just x{xxxx}. So in other words, you get the same error with

julia> r"[x{00A0}-x{10FFFD}]"
ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 9

And it suddenly all makes sense, because }-x is indeed an out-of-order range.

I guess that choice in favour of ECMAscript syntax for \u, \U and \x warrants to be examined, justified, and documented. (Ideally, I think the Julia manual should contain a self-contained reference of the regular-expression syntax supported.)

So this is clearly not a bug in the PCRE2 C library, but at least an omission in the Julia manual.

@mgkuhn
Copy link
Contributor

mgkuhn commented Aug 3, 2022

Digging through the commit history of where the choice of JavaScript-compatible \x\u\U in Julia regular expressions via PCRE.ALT_BSUX came from:

  • afa1404 in Jan 2015 replaced PCRE compile option PCRE.JAVASCRIPT_COMPAT with PCRE2 option PCRE.ALT_BSUX while upgrading from PCRE to PCRE2, i.e. this seems to be just adjusting to the new API
  • 7909e3d in Mar 2013 added PCRE.JAVASCRIPT_COMPAT to “fix r"\u2220" bug mentioned in make S"..." and "..." throw errors identically #107

The latter commit was made by @nolta as a “band-air”.

String literals, macro/raw string literals and the resulting differences in quote and backslash escaping clearly had a rather tortuous history in the evolution of Julia. Note that at no point in issue #107 is there any discussion about whether Julia's flavour of PCRE should be more like Perl or more like JavaScript. The choice of the JavaScript variant just happened to cause one error message in one example to disappear, if I understood that discussion correctly.

They wanted match(r"\u2200", "\u2200") to match, whereas in Perl-compatible regular-expression syntax it would have had to be match(r"\x{2200}", "\u2200") because in Perl RE, \u means “lowercase the next letter”. Note that in this example, the first \u is interpreted by PCRE2, whereas the second is part of Julia's string literal syntax. They are not the same syntax, but just happen to overlap in this particular example, whereas e.g. a slight variant such as match(r"\U102200", "\U102200") does not match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

4 participants