Re-enable dollar ($) line anchor in regular expressions in find mode #5289

NVnavkumar · 2022-04-21T00:40:33Z

Fixes #4533.

This re-enables support for the end of line anchor ($) in regular expressions. A couple of caveats:

This only enables support in RegexFindMode. A separate issue will need to filed for Replace and Split modes (if necessary).
Apache Spark actually only uses regular expressions in standard mode (not multiline). Which means that we only support the $ as defined in that mode. This will not match the $ in multiline mode.
This code handles how the line terminators are managed around the line anchor. The $ has different matching characteristics when used with line terminator characters in the regular expression. Particularly here are a couple of examples:

using the $ anchor with characters other than a line terminator sequence, the dollar then needs to include the optional matching of possible line terminator sequences (defined here in the Line terminators section).
using the $ anchor with line terminator characters changes the behavior of the regular expression to potentially force the matching of a specific line terminator, there are 4\ cases to call out here:
- \r$ - this means to only match the CR before the end of the string, so no need to transpile in this case, just match the end of the string using $ in cuDF, and the strings are equivalent
- [any other line terminator character including \n]$ - this means match that line terminator character plus optionally any other valid line terminator character before the end of the string $
- $\n - this means to only match the LF (newline) before the end of the string, and no other line termination sequence. This requires the underlying cuDF to support negative lookahead groups, so this case will fall back to the CPU. (this is because \r\n is a valid line terminator sequence, and this means to forcefully not support that sequence which can only be handled by a negative lookahead group -- See cudf#3100 on lookaheads)
- [any other line character including \r]$ - this means match that line terminator character plus optionally any other valid line terminator character before the end of the string $
multiple $$ in a row are handled by Java by just reducing to 1 $

Signed-off-by: Navin Kumar <navink@nvidia.com>

…o enable_dollar_anchor_regex

…able_dollar_anchor_regex

Signed-off-by: Navin Kumar <navink@nvidia.com>

…r line termination characters Signed-off-by: Navin Kumar <navink@nvidia.com>

…able_dollar_anchor_regex

… support to transpile. Also, add more comments Signed-off-by: Navin Kumar <navink@nvidia.com>

Signed-off-by: Navin Kumar <navink@nvidia.com>

andygrove · 2022-04-22T13:37:23Z

This is looking good. Should we also include form-feed \f in the tests?

…able_dollar_anchor_regex

…inator to test whitespace around a line anchor Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-04-22T18:08:57Z

build

…able_dollar_anchor_regex

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-04-25T18:36:38Z

build

NVnavkumar added 10 commits April 14, 2022 12:10

WIP: dollar anchor support

c2310e0

Signed-off-by: Navin Kumar <navink@nvidia.com>

Solved issue with anchor in character class (treat as literal $)

2857de8

Signed-off-by: Navin Kumar <navink@nvidia.com>

Merge branch 'branch-22.06' of github.com:NVnavkumar/spark-rapids int…

4b330ff

…o enable_dollar_anchor_regex

Merge branch 'branch-22.06' of github.com:NVIDIA/spark-rapids into en…

f7b3f1a

…able_dollar_anchor_regex

WIP: more dollar anchor support, a few more cases to tidy up

99abe18

Signed-off-by: Navin Kumar <navink@nvidia.com>

Remove next parameter and fix style issues

9c2879c

Signed-off-by: Navin Kumar <navink@nvidia.com>

Handle most edge cases with line anchor $

40e0746

Signed-off-by: Navin Kumar <navink@nvidia.com>

Finish handling last few edge cases regarding carriage return vs othe…

868a3ff

…r line termination characters Signed-off-by: Navin Kumar <navink@nvidia.com>

Merge branch 'branch-22.06' of github.com:NVIDIA/spark-rapids into en…

680b7fa

…able_dollar_anchor_regex

Disable one particular edge case as we don't have the underlying cudf…

d789db5

… support to transpile. Also, add more comments Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar requested a review from andygrove April 21, 2022 00:40

NVnavkumar self-assigned this Apr 21, 2022

forgot to add these examples to comments

4f1cd55

Signed-off-by: Navin Kumar <navink@nvidia.com>

sameerz added the feature request New feature or request label Apr 21, 2022

sameerz added this to the Apr 18 - Apr 29 milestone Apr 21, 2022

NVnavkumar added 2 commits April 22, 2022 10:11

Merge branch 'branch-22.06' of github.com:NVIDIA/spark-rapids into en…

62b4eec

…able_dollar_anchor_regex

add tests including form feed \f which is not technically a line term…

0200411

…inator to test whitespace around a line anchor Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar added 2 commits April 25, 2022 10:56

Merge branch 'branch-22.06' of github.com:NVIDIA/spark-rapids into en…

93bc564

…able_dollar_anchor_regex

fix additional regular expression tests that now run on GPU

d187506

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar marked this pull request as ready for review April 25, 2022 18:36

andygrove approved these changes Apr 25, 2022

View reviewed changes

NVnavkumar merged commit cc3af4b into NVIDIA:branch-22.06 Apr 26, 2022

NVnavkumar deleted the enable_dollar_anchor_regex branch April 26, 2022 02:06

anthony-chang mentioned this pull request May 24, 2022

Fall back to CPU for unsupported regular expression edge cases with end of line/string anchors and newlines #5610

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable dollar ($) line anchor in regular expressions in find mode #5289

Re-enable dollar ($) line anchor in regular expressions in find mode #5289

NVnavkumar commented Apr 21, 2022 •

edited

Loading

andygrove commented Apr 22, 2022

NVnavkumar commented Apr 22, 2022

NVnavkumar commented Apr 25, 2022

Re-enable dollar ($) line anchor in regular expressions in find mode #5289

Re-enable dollar ($) line anchor in regular expressions in find mode #5289

Conversation

NVnavkumar commented Apr 21, 2022 • edited Loading

andygrove commented Apr 22, 2022

NVnavkumar commented Apr 22, 2022

NVnavkumar commented Apr 25, 2022

NVnavkumar commented Apr 21, 2022 •

edited

Loading