Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Minimize false positives when falling back to CPU for end of line/string anchors and newlines #5659

Closed
anthony-chang opened this issue May 26, 2022 · 1 comment · Fixed by #8081
Assignees
Labels
bug Something isn't working

Comments

@anthony-chang
Copy link
Contributor

Describe the bug
The changes in #5610 add very broad checks for patterns containing end of line anchor $ immediately next to a newline, begin-of-line anchor ^ or repetition that could produce empty results. However, this results in false positives:

  1. $ and ^ inside character classes are being treated as anchors, eg. the pattern 0*[D$3] has no string anchors but will still fallback to CPU.
  2. We are checking the entire AST node before/after a $, meaning something like \na$ is allowed, but (\na)$ will fallback to CPU. We should instead check only the component in the group that is closest to the $.
  3. We fallback to CPU for cases such as ^([a-z]*)([0-9]*)([a-z]*)$ which have no line terminator characters.

Steps/Code to reproduce bug
Test with the above examples.

Expected behavior
We should support these cases on GPU, and they should be consistent with CPU

Environment details (please complete the following information)
N/A

Additional context
None.

@andygrove
Copy link
Contributor

Examples in RegularExpressionSuite that we want to support

  // https://github.com/NVIDIA/spark-rapids/issues/5659
  testGpuFallback("String regexp_extract regex 1",
    "ProjectExec", extractStrings, conf = conf,
    execsAllowedNonGpu = Seq("ProjectExec", "ShuffleExchangeExec")) {
    frame => frame.selectExpr("regexp_extract(strings, '^([a-z]*)([0-9]*)([a-z]*)$', 1)")
  }

  // https://github.com/NVIDIA/spark-rapids/issues/5659
  testGpuFallback("String regexp_extract regex 2",
    "ProjectExec", extractStrings, conf = conf,
    execsAllowedNonGpu = Seq("ProjectExec", "ShuffleExchangeExec")) {
    frame => frame.selectExpr("regexp_extract(strings, '^([a-z]*)([0-9]*)([a-z]*)$', 2)")
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants