Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex fails expected match #47936

Closed
jjvliu opened this issue Dec 20, 2022 · 18 comments · Fixed by #47940
Closed

Regex fails expected match #47936

jjvliu opened this issue Dec 20, 2022 · 18 comments · Fixed by #47940
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version strings "Strings!"

Comments

@jjvliu
Copy link

jjvliu commented Dec 20, 2022

I created a minimal working example for a problem in the regular expression engine. I was able to reproduce the issue on the latest stable release 1.8.3, on 1.9.0-alpha1, and on master built today. It seems to be a problem with the engine failing to backtrack.

julia> match(r"a+[bc]+c", "ababc")

julia> versioninfo()
Julia Version 1.10.0-DEV.156
Commit 9be3c85e49 (2022-12-19 21:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 72 × Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 72 virtual cores
julia> match(r"a+[bc]+c", "ababc")

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 72 × Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 1 on 72 virtual cores
@Seelengrab
Copy link
Contributor

Seelengrab commented Dec 20, 2022

Your regex is matching at least one a, followed by at least one b or c, followed by a single c. To match ababc, you need to add a grouping specifier:

julia> match(r"(a+[bc])+c", "ababc")
RegexMatch("ababc", 1="ab")

@jakobnissen
Copy link
Contributor

jakobnissen commented Dec 20, 2022

Why does it not match the last three characters though? Regex need not match the full string, but should be able to match a substring. For example:

julia> m = match(r"a+[bc]c", "ababc")
RegexMatch("abc")

Here, the last three letters are being matched. Similarly, OP's regex should match the same last three letters.

@jakobnissen jakobnissen reopened this Dec 20, 2022
@martinholters
Copy link
Member

Also:

julia> match(r"a[bc]+c", "ababc")
RegexMatch("abc")

Seems to have regressed in 1.6, btw as with 1.5:

julia> match(r"a+[bc]+c", "ababc")
RegexMatch("abc")

@Seelengrab
Copy link
Contributor

Nice catch, I missed that this was the intention!

@jakobnissen jakobnissen added bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version strings "Strings!" labels Dec 20, 2022
@Seelengrab
Copy link
Contributor

Cursory testing suggests this is due to anchoring - see https://regex101.com/r/YoewW3/1 and toggle the Anchoring flag.

@fredrikekre
Copy link
Member

Change and/or regression in PCRE2 then, it was updated from 10.36 to 10.40 between 1.7 and 1.8 (v1.7.3...v1.8.3#diff-077a0e0d8ff036a0ed8b93361a2c26cfc8ce3d96049890896e25c20d06e07461R97).

@Seelengrab
Copy link
Contributor

Which is weird, since we don't pass the anchoring flag in that codepath, as far as I can tell 🤔

It does work though if we pass in ENDANCHORED directly:

julia> match(r"a+[bc]+c"a, "ababc", 1, Base.PCRE.ANCHORED)

julia> match(r"a+[bc]+c"a, "ababc", 1, Base.PCRE.ENDANCHORED)
RegexMatch("abc")

@martinholters
Copy link
Member

Change and/or regression in PCRE2 then, it was updated from 10.36 to 10.40 between 1.7 and 1.8

Nope, must have been earlier than that:

julia> VERSION
v"1.5.1"

julia> match(r"a+[bc]+c", "ababc")
RegexMatch("abc")
julia> VERSION
v"1.6.7"

julia> match(r"a+[bc]+c", "ababc")

@fredrikekre
Copy link
Member

Works in 1.6.6, and indeed that same version bump of PCRE was backported for 1.6.7 (v1.6.6...v1.6.7#diff-077a0e0d8ff036a0ed8b93361a2c26cfc8ce3d96049890896e25c20d06e07461L94).

@Seelengrab
Copy link
Contributor

Seelengrab commented Dec 20, 2022

The only change in PCRE related to anchoring I can find the in the changelog between 10.36 and 10.40 is this, in 10.38:

https://github.com/PCRE2Project/pcre2/blob/0746b3d523c9a799aa62ae9e03188b23664f48b0/ChangeLog#L338-L342

So we could try with 10.37 to see if 10.38 is to blame, and the current most recent version 10.42 and if that also doesn't work, open an issue upstream?

@barucden
Copy link
Contributor

FWIW, the regex seems to work in pcre2 10.40:

$ pcre2test
PCRE2 version 10.40 2022-04-14
  re> "a+[bc]+c"
data> "ababc"
 0: abc

@Seelengrab
Copy link
Contributor

Seelengrab commented Dec 20, 2022

Well it also doesn't work if I build PCRE2 10.37 from source on a current-ish master, but it DOES work if I build with 10.36. Set USE_BINARYBUILDER_PCRE=0 in Make.user and set PCRE_VER := 10.36 in deps/pcre.version.

So something must have changed in PCRE2 v10.37, with regards to the flags we are passing.

PCRE2Project/pcre2@pcre2-10.36...pcre2-10.37

I can't spot anything that jumps out as being the culprit just from those commits though..

julia> vstr = zeros(UInt8, 32);

julia> using PCRE2_jll
[ Info: Precompiling PCRE2_jll [efcefdf7-47ab-520b-bdef-62a2eaa19f15]

julia> ccall((:pcre2_config_8, libpcre2_8), Cint, (UInt32, Ref{UInt8}), 11, vstr)
17

julia> VersionNumber(split(unsafe_string(pointer(vstr)), " ")[1])
v"10.37.0"

julia> match(r"a+[bc]+c", "ababc")

julia> versioninfo()
Julia Version 1.10.0-DEV.147
Commit 7b10d5fe01* (2022-12-14 19:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24
julia> vstr = zeros(UInt8, 32);

julia> using PCRE2_jll
[ Info: Precompiling PCRE2_jll [efcefdf7-47ab-520b-bdef-62a2eaa19f15]

julia> ccall((:pcre2_config_8, libpcre2_8), Cint, (UInt32, Ref{UInt8}), 11, vstr)
17

julia> VersionNumber(split(unsafe_string(pointer(vstr)), " ")[1])
v"10.36.0"

julia> match(r"a+[bc]+c", "ababc")
RegexMatch("abc")

julia> versioninfo()
Julia Version 1.10.0-DEV.147
Commit 7b10d5fe01* (2022-12-14 19:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24

@giordano
Copy link
Contributor

For what is worth, last week I built pcre2 10.41 and 10.42: https://github.com/JuliaBinaryWrappers/PCRE2_jll.jl. I haven't got around updating it here in Julia, but if anyone wants to do it before me (I don't know when I'd be able to do it in the next few weeks), please go ahead.

@jjvliu
Copy link
Author

jjvliu commented Dec 20, 2022

Cursory testing suggests this is due to anchoring - see https://regex101.com/r/YoewW3/1 and toggle the Anchoring flag.

Based on jakobnissen and martinholters's counterexamples that simply remove a +, the issue is unlikely to be related to anchoring. If I had to guess, it has to do with incorrect branching logic for backtracking after partial matches from the second greedy plus. But I am not sure how fruitful it is to guess what logic broke (rather than check what commit broke in JuliaLang or PCRE2 as in the more recent discussion)—the issue is quite situational. For example, here are some similar cases that do not display the issue:

julia> match(r"a+[bc]{1,2}c", "ababc")
RegexMatch("abc")

julia> match(r"(a)+[bc]+c", "ababc")
RegexMatch("abc", 1="a")

Here are some similar cases that share the issue:

julia> match(r"a{1,2}[bc]+c", "ababc")

julia> match(r"(a+)[bc]+c", "ababc")

@Seelengrab
Copy link
Contributor

Seelengrab commented Dec 20, 2022

As mentioned above, recent master with PCRE2 10.36 works, while PCRE2 10.37 does not, leading me to believe this is an issue with PCRE. To be clear, I also think now that anchoring was a red herring.

Here's what I get for your examples on a local build with 10.42:

julia> ccall((:pcre2_config_8, libpcre2_8), Cint, (UInt32, Ref{UInt8}), 11, vstr)
17

julia> VersionNumber(split(unsafe_string(pointer(vstr)), " ")[1])
v"10.42.0"

julia> match(r"a+[bc]+c", "ababc")
RegexMatch("abc")

julia> match(r"a+[bc]{1,2}c", "ababc")
RegexMatch("abc")

julia> match(r"(a)+[bc]+c", "ababc")
RegexMatch("abc", 1="a")

julia> match(r"a{1,2}[bc]+c", "ababc")
RegexMatch("abc")

julia> match(r"(a+)[bc]+c", "ababc")
RegexMatch("abc", 1="a")

julia> versioninfo()
Julia Version 1.10.0-DEV.147
Commit 7b10d5fe01* (2022-12-14 19:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24

So looks like this really was a bug in PCRE and us bumping the version should fix it in the next release. Should be backportable too. @giordano which places besides deps/pcre.version need updating? I can take a look & open a PR, since I already have it built locally.

@giordano
Copy link
Contributor

which places besides deps/pcre.version need updating? I can take a look & open a PR, since I already have it built locally.

See #47174 (comment)

Seelengrab added a commit to Seelengrab/julia that referenced this issue Dec 20, 2022
@giordano
Copy link
Contributor

giordano commented Dec 20, 2022

I confirm this is fixed by upgrading to PCRE2 10.42 in #47940

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 1.10.0-DEV.148 (2022-12-20)
 _/ |\__'_|_|_|\__'_|  |  bump_pcre/ac77be92b5* (fork: 1 commits, 6 days)
|__/                   |

julia> match(r"a+[bc]+c", "ababc")
RegexMatch("abc")

@StefanKarpinski
Copy link
Member

Yikes. This is a bad regression. Glad we have tests against this now. Chagrinned that we didn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version strings "Strings!"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants