-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] split_record
output empty list for empty input string
#16453
Comments
CC @davidwendt. |
The output looks correct to me. It is returning a list with a single empty string and not an empty list.
The
You can see this unit test also shows that any empty string will return an empty string cudf/cpp/tests/strings/split_tests.cpp Lines 246 to 260 in 211dbe4
Note that the last entry is an empty string and the expected value for that row is LCW{""} and not LCW{}
|
The output seems to be correct in that case, in which the input has other non-empty strings. cudf/cpp/tests/strings/split_tests.cpp Lines 310 to 318 in 211dbe4
|
Thank you @ttnghia for investigating this. We certainly have a bug in Is it possible for Spark-RAPIDS to detect the all-empty string column case and avoid calling libcudf? |
I think we can workaround that in the plugin, waiting for cudf fix in the next release. Thanks Greg. |
Fixes specialized behavior for all empty input column on the strings split APIs. Verifying behavior with Pandas `str.split( pat, expand, regex )` `pat=None -- whitespace` `expand=False -- record APIs` `regex=True -- re APIs` - [x] `split` - [x] `split` - whitespace - [x] `rsplit` - [x] `rsplit` - whitespace - [x] `split_record` - [x] `split_record` - whitespace - [x] `rsplit_record` - [x] `rsplit_record` - whitespace - [x] `split_re` - [x] `rsplit_re` - [x] `split_record_re` - [x] `rsplit_record_re` Closes #16453 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #16466
For an empty input string,
split_record
does not output anything (empty list). This is inconsistent with the cases when the input does not contain the split delimiter then the output list has one element equal to the input. For example:However:
This inconsistent behavior causes mismatching between cudf and Java String.split API, which states that:
Ref: https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-int-
The text was updated successfully, but these errors were encountered: