Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support regex lookarounds #3100

Closed
beckernick opened this issue Mar 6, 2019 · 6 comments
Closed

[FEA] Support regex lookarounds #3100

beckernick opened this issue Mar 6, 2019 · 6 comments
Labels
feature request New feature or request strings strings issues (C++ and Python) wontfix This will not be worked on

Comments

@beckernick
Copy link
Member

Description

As a user, I would like to be able to include positive and negative lookaheads / lookbehinds in my regular expressions.

Example Behavior

As an example of a negative lookahead, see: https://regex101.com/r/0275Hq/1

Given

qate
quit
qatar

The regex pattern q(?!u) should match the q in qate and in qatar, but not in quit. In words, "Find the q, but only if it's not followed by a u."

For a nice description of lookarounds, see: https://www.regular-expressions.info/lookaround.html

@fynv
Copy link

fynv commented Mar 7, 2019

As a designing choice, we currently use a NFA engine for regex matching.
The main benefit of NFA engine is to avoid recursive calls, which is crucial for the stability of a GPU program. It is also the most efficient algorithm in most cases. (Russ Cox has some articles about the algorithm https://swtch.com/~rsc/regexp/)
However, NFA engines have some inherent limitations. AFAIK, features like backreferences and lookarounds are not possible to be supported (without breaking the structure of the NFA engine).
Seems Russ Cox did not state this explicitly, but he chose not to implement these features in his own engines, including Re2: https://github.com/google/re2/wiki/WhyRE2
Unless there's opposite information, I suggest we consider this issue not fixable.

@beckernick
Copy link
Member Author

Thanks for linking those pages, @fynv. Will give some of the articles a read to better understand.

@randerzander
Copy link
Contributor

@fynv @davidwendt I'm confused the "features like backreferences lookarounds are not possible to be supported" statement.

rapidsai/custrings#94 suggests backreferences are already supported, at least in some cases. Can you comment?

@fynv
Copy link

fynv commented Mar 18, 2019

Sorry for the confusion. The "backreferences" in rapidsai/custrings#94 refers to the '\1' '\2' elements in the replace expression ('repl') of nvstrings.replace_with_backrefs(), not in the regular expression, although it does have to work with a regular expression.

def replace_with_backrefs(self, pat, repl)

Here 'pat' cannot include '\1', '\2' while 'repl' can include them.

@galipremsagar
Copy link
Contributor

I have a similar issue with my use case where I'm getting around by doing a workaround. So I would like to know which of all strings don't start with a word(say censor). Using negative lookahead we could straight away retrieve the strings that don't start with censor.

import nvstrings
s = nvstrings.to_device(["censor", "test", "one two", "another string", "censor string "])
print(s.match('^(?!censor).*$'))

Output:

[False, False, False, False, False]

But the expected output is:

[False, True, True, True, False]

Since negative lookaheads aren't supported my work-around code is to match strings that start with word censor and negate the resultant boolean array.

import nvstrings
import numpy
s = nvstrings.to_device(["censor", "test", "one two", "another string", "censor string "])
print(~numpy.array(s.match('^(censor).*$')))

Output:

[False  True  True  True False]

cc: @beckernick

@GregoryKimball
Copy link
Contributor

I'm closing this as "wontfix" because lookarounds are not compatible with the current NFA regex engine. We would need a new engine to support lookarounds. If there is more demand for lookarounds, we recommend dispatching to a CPU regex engine - something that will become much faster in next gen hardware such as the Grace Hopper Superchip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request strings strings issues (C++ and Python) wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants