Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] .str.findall returning incorrect results when using a quantifier with a capturing group #16730

Open
JamesMaki opened this issue Sep 3, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@JamesMaki
Copy link

Describe the bug
cuDF .str.findall returns incorrect results with regex pattern that uses quanitifier with a capturing group.

Steps/Code to reproduce bug

reg_ex = r'(\d{4}\s){4}'
test_string = 'TEST12 1111 2222 3333 4444 5555'
import cudf
s = cudf.Series([test_string])
s.str.findall(reg_ex)
# returns 
# 0    [2 , 1 , 2 , 3 , 4 ]
# dtype: list

Note: Without the quantifier, shortening the pattern to just r'(\d{4}\s)', cuDF returns the correct results of [1111 , 2222 , 3333 , 4444 ].

Expected behavior

reg_ex = r'(\d{4}\s){4}'
test_string = 'TEST12 1111 2222 3333 4444 5555'
import re
re.findall(reg_ex, test_string)
# returns
# ['4444 ']

Environment overview (please complete the following information)
Tested in latest NGC Docker image on RTX 5880 Ada and A100 SXM, also confirmed this behavior exists in the latest nightly build.

@JamesMaki JamesMaki added the bug Something isn't working label Sep 3, 2024
@davidwendt davidwendt self-assigned this Sep 5, 2024
@davidwendt
Copy link
Contributor

This has uncovered a couple of issues. First, there is a bug in libcudf when handling nested quantifiers which is addressed in PR #16798
Second, the rules for matching in findall do not match the python definition for re.findall():

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

The current behavior does not consider the existing of capture groups and so this will need to be addressed in a separate PR.
One sticking point is the handling of multiple groups which specifies returning tuples. Since tuples are not a libcudf type, the closest result would be either a flattened list column (consecutive row elements represent the tuple) or a nested list column.

rapids-bot bot pushed a commit that referenced this issue Oct 10, 2024
Fixes the libcudf regex parsing logic when handling nested fixed quantifiers. The logic handles fixed quantifiers by simple repeating the previous instruction. If the previous item is a group (capture or non-capture) that group may also contain an internal fixed quantifier as well.
Found while working on #16730

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #16798
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants