Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re, nre have wrong start semantics #14284

Closed
timotheecour opened this issue May 9, 2020 · 5 comments · Fixed by #14483
Closed

re, nre have wrong start semantics #14284

timotheecour opened this issue May 9, 2020 · 5 comments · Fixed by #14483

Comments

@timotheecour
Copy link
Member

timotheecour commented May 9, 2020

re, nre have wrong start semantics; nim-regex has correct semantics

Example

when defined case1: # D20200509T003306:here gitissue
  import unittest
  import re
  proc main()=
    let src = """xabc"""
    let pattern = "^abc$".re
    check match(src, pattern, start = 1) # fails 
    check find(src, pattern, start = 1) == 0 # fails 
    let pattern2 = "abc$".re
    check find(src, pattern2, start = 1) == 0 # fails; it's showing as relative to src[0] instead of src[start]
  main()

when defined case2:
  import nre
  import unittest
  proc main()=
    let src = """xabc"""
    let pattern = "^abc$".re
    let ret = find(src, pattern, start = 1)
    check ret.isSome

    let pattern2 = "abc$".re
    let ret2 = find(src, pattern2, start = 1)
    check ret2.get.matchBounds.a == 0
  main()

when defined case3: # D20200501T000945
  import pkg/regex
  proc main()=
    let src = """xabc"""
    let start = 1
    var m: RegexMatch
    let regx = "^abc$".re
    let ok = find(src, regx, m, start = start)
    doAssert ok # works
  main()

Current Output

nim r -d:case1 main
all checks fail
nim r -d:case2 main
all checks fail

this works:
nim r -d:case3 main

Expected Output

all checks succeed

Possible Solution

  • 1st class openarrays (make openArray first-class RFCs#88) would solve this problem with deprecation but no breaking change: we'd deprecate all procs with start (in re) and with start or end (in nre), and replace all these with procs that take a string slice; there would be no room for guess-work since start would not even part of the API

Additional Information

@andreaferretti
Copy link
Collaborator

Are you saying that you expect ^abc$ to match xabc provided one passes start=1?

The ^ symbol denotes start of string, which clearly is not there. A regex starting with ^ should never match anything by definition, unless start=0. For comparison, python agrees with this interpretation

@timotheecour
Copy link
Member Author

timotheecour commented May 11, 2020

@andreaferretti ok the following example convinces me that you're right, and that start is useful as a way to restrict the search space while preserving the context preceding the start position.
The main point is that passing a substring (or openArray slice, that's irrelevant here) would lose the prefix context, so would have different semantics (and wouldn't be able to express the same things), eg:

when true: # D20200511T013107:here gitissue
  import nre
  proc main()=
    let src = """xabc"""
    let pattern = re"(?<=x|y)abc"
    let ret = find(src, pattern, start = 1)
    echo ret
  main()

Still though, before closing this issue, I think we should

  • clarify semantics of start in the docs of re,nre, and have a runnableExamples illustrating this (and probably change nim-regex (for the 3rd time...) to match that behavior, sorry!)
  • (partly related) mention lookahead, lookbehind (both positive and negative assertions), eg (?<=x|y), in the docs, simply as a runnableExamples is enough

@nitely do you agree?

@nitely
Copy link
Contributor

nitely commented May 11, 2020

The docs say: returns true if s[start..] matches the pattern. It's true Python's regex and PCRE share re behavior. I've not used start often in any language that supports it, and I've used text[start..end] the few times I needed to match a substring, but maybe the current behavior is useful to someone. It's all the same to me.

@andreaferretti
Copy link
Collaborator

start is generally useful in loops, where you consume the string up to a point, do something, then continue matching past that point (possibly skipping some parts, which means it is different from iterating over all matches).

Using s[start..] would work in this context only if substrings are constant time, otherwise we get a quadratic behaviour.

In all this cases, one rarely uses ^, because by definition you are not looking at the beginning of a string, so either behvaiour is fine in my opinion, provided it is documented. The python behvaiour seems more natural to me, and since this is already the behaviour of re, I wouldn't change it

@timotheecour
Copy link
Member Author

timotheecour commented Oct 30, 2020

@andreaferretti so how do we specify we want the match to occur at position i when using start = i, given that ^ would mean position = 0 ?
Note that match is not acceptable, since it also adds a matching end of string constraint ($).

(see also nitely/nim-regex#80)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants