-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex bug: Unicode hex ranges not supported #46137
Comments
In any case, if this is indeed a bug it is a bug in PCRE2 and not Julia. |
Unfortunately, the bug applies to all Unicode ranges, not just ones with invalid characters. Even simply typing julia> r"[\x{00A0}-\x{00A5}]" throws a
Please reopen this issue? |
It is still an error from PCRE, but this one doesn't show up in other environments (e.g. https://regex101.com/) so perhaps there is some compile setting that is different. |
Workaround:
julia> '和'
'和': Unicode U+548C (category Lo: Letter, other)
julia> '平'
'平': Unicode U+5E73 (category Lo: Letter, other)
julia> contains("和", r"[\N{U+548C}-\N{U+5E73}]")
true
julia> contains("平", r"[\N{U+548C}-\N{U+5E73}]")
true
julia> contains("aaa", r"[\N{U+548C}-\N{U+5E73}]")
false
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake) |
SOLUTION: The command
is short for
In the latter form, we can play with the compile and match option flags that are passed to the PCRE2 library to specify what flavour of regular-expression behaviour exactly we want. Doing that, I quickly found that dropping the
Now it is time to actually read the PCRE2 documentation:
There we find indeed the answer:
In other words, Julia asks PCRE2 to implement a slightly more JavaScript-compatible version of regular expressions than the more Perl-compatible flavor it would have given us by default. The man page doesn't explicitly say so, but the way I read it,
And it suddenly all makes sense, because I guess that choice in favour of ECMAscript syntax for So this is clearly not a bug in the PCRE2 C library, but at least an omission in the Julia manual. |
Digging through the commit history of where the choice of JavaScript-compatible
The latter commit was made by @nolta as a “band-air”. String literals, macro/raw string literals and the resulting differences in quote and backslash escaping clearly had a rather tortuous history in the evolution of Julia. Note that at no point in issue #107 is there any discussion about whether Julia's flavour of PCRE should be more like Perl or more like JavaScript. The choice of the JavaScript variant just happened to cause one error message in one example to disappear, if I understood that discussion correctly. They wanted |
Trying to form Unicode hex ranges in a regular expression causes a
LoadError
:yields
The result should be a regex that matches all Unicode codepoints from
U+00A0
toU+10FFFD
.Julia version: 1.7.3
The text was updated successfully, but these errors were encountered: