You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environment overview (please complete the following information)
Tested in latest NGC Docker image on RTX 5880 Ada and A100 SXM, also confirmed this behavior exists in the latest nightly build.
The text was updated successfully, but these errors were encountered:
This has uncovered a couple of issues. First, there is a bug in libcudf when handling nested quantifiers which is addressed in PR #16798
Second, the rules for matching in findall do not match the python definition for re.findall():
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
The current behavior does not consider the existing of capture groups and so this will need to be addressed in a separate PR.
One sticking point is the handling of multiple groups which specifies returning tuples. Since tuples are not a libcudf type, the closest result would be either a flattened list column (consecutive row elements represent the tuple) or a nested list column.
Fixes the libcudf regex parsing logic when handling nested fixed quantifiers. The logic handles fixed quantifiers by simple repeating the previous instruction. If the previous item is a group (capture or non-capture) that group may also contain an internal fixed quantifier as well.
Found while working on #16730
Authors:
- David Wendt (https://github.com/davidwendt)
Approvers:
- Bradley Dice (https://github.com/bdice)
- Vyas Ramasubramani (https://github.com/vyasr)
URL: #16798
Describe the bug
cuDF .str.findall returns incorrect results with regex pattern that uses quanitifier with a capturing group.
Steps/Code to reproduce bug
Note: Without the quantifier, shortening the pattern to just
r'(\d{4}\s)'
, cuDF returns the correct results of[1111 , 2222 , 3333 , 4444 ]
.Expected behavior
Environment overview (please complete the following information)
Tested in latest NGC Docker image on RTX 5880 Ada and A100 SXM, also confirmed this behavior exists in the latest nightly build.
The text was updated successfully, but these errors were encountered: