[BUG] `split_record` output empty list for empty input string #16453

ttnghia · 2024-07-31T21:32:48Z

For an empty input string, split_record does not output anything (empty list). This is inconsistent with the cases when the input does not contain the split delimiter then the output list has one element equal to the input. For example:

input = "abcxyz"
output = split_record(input, "@")
output = ["abcxyz"]

However:

input = ""
output = split_record(input, "@")
output = []

This inconsistent behavior causes mismatching between cudf and Java String.split API, which states that:

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

Ref: https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-int-

The text was updated successfully, but these errors were encountered:

ttnghia · 2024-07-31T21:32:57Z

CC @davidwendt.

davidwendt · 2024-08-01T17:27:19Z

The output looks correct to me. It is returning a list with a single empty string and not an empty list.
The input string is empty and so it returns an empty string in the output.

>>> import cudf
>>> import pandas as pd
>>> ps = pd.Series(['','abcdef'])
>>> ps.str.split('@')
0          []
1    [abcdef]
dtype: object
>>> gs = cudf.Series(['','abcdef'])
>>> gs.str.split('@')
0          []
1    [abcdef]
dtype: list

The [] contains an empty string and does not indicate an empty list.

>>> ps.str.split('@')[0]
['']
>>> gs.str.split('@')[0]
['']

You can see this unit test also shows that any empty string will return an empty string

cudf/cpp/tests/strings/split_tests.cpp

Lines 246 to 260 in 211dbe4

    
           TEST_F(StringsSplitTest, SplitRecord) 
        
           { 
        
             std::vector<char const*> h_strings{" Héllo thesé", nullptr, "are some  ", "tést String", ""}; 
        
             auto validity = 
        
               thrust::make_transform_iterator(h_strings.begin(), [](auto str) { return str != nullptr; }); 
        
             cudf::test::strings_column_wrapper strings(h_strings.begin(), h_strings.end(), validity); 
        
             auto result = 
        
               cudf::strings::split_record(cudf::strings_column_view(strings), cudf::string_scalar(" ")); 
        
             using LCW = cudf::test::lists_column_wrapper<cudf::string_view>; 
        
             LCW expected( 
        
               {LCW{"", "Héllo", "thesé"}, LCW{}, LCW{"are", "some", "", ""}, LCW{"tést", "String"}, LCW{""}}, 
        
               validity); 
        
             CUDF_TEST_EXPECT_COLUMNS_EQUAL(result->view(), expected); 
        
           }

Note that the last entry is an empty string and the expected value for that row is LCW{""} and not LCW{}

ttnghia · 2024-08-01T17:29:40Z

The output seems to be correct in that case, in which the input has other non-empty strings.
However, for input with all empty, the output is empty:

cudf/cpp/tests/strings/split_tests.cpp

Lines 310 to 318 in 211dbe4

    
           TEST_F(StringsSplitTest, SplitRecordAllEmpty) 
        
           { 
        
             auto input     = cudf::test::strings_column_wrapper({"", "", "", ""}); 
        
             auto sv        = cudf::strings_column_view(input); 
        
             auto delimiter = cudf::string_scalar("s"); 
        
             auto empty     = cudf::string_scalar(""); 
        
             using LCW = cudf::test::lists_column_wrapper<cudf::string_view>; 
        
             LCW expected({LCW{}, LCW{}, LCW{}, LCW{}});

GregoryKimball · 2024-08-01T19:07:49Z

Thank you @ttnghia for investigating this. We certainly have a bug in split and rsplit with an input column of all empty strings. @davidwendt and I discussed this issue and we agree that fixing the root cause will be too involved to make it in time for 24.08. We looked into some edge-case hacks on the libcudf side and couldn't find a working solution that would cover split, split_records, split_re, rsplit, etc. We plan to address this edge case bug in 24.10.

Is it possible for Spark-RAPIDS to detect the all-empty string column case and avoid calling libcudf?

ttnghia · 2024-08-01T22:02:53Z

I think we can workaround that in the plugin, waiting for cudf fix in the next release. Thanks Greg.

Fixes specialized behavior for all empty input column on the strings split APIs. Verifying behavior with Pandas `str.split( pat, expand, regex )` `pat=None -- whitespace` `expand=False -- record APIs` `regex=True -- re APIs` - [x] `split` - [x] `split` - whitespace - [x] `rsplit` - [x] `rsplit` - whitespace - [x] `split_record` - [x] `split_record` - whitespace - [x] `rsplit_record` - [x] `rsplit_record` - whitespace - [x] `split_re` - [x] `rsplit_re` - [x] `split_record_re` - [x] `rsplit_record_re` Closes #16453 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #16466

ttnghia added the bug Something isn't working label Jul 31, 2024

github-project-automation bot added this to cuDF/Dask/Numba/UCX Jul 31, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Jul 31, 2024

davidwendt self-assigned this Jul 31, 2024

GregoryKimball added this to libcudf Aug 1, 2024

GregoryKimball moved this to Burndown in libcudf Aug 1, 2024

ttnghia mentioned this issue Aug 2, 2024

[BUG] String split APIs on empty string produce incorrect result NVIDIA/spark-rapids#11287

Closed

revans2 mentioned this issue Aug 2, 2024

Add work around for string split with empty input. NVIDIA/spark-rapids#11292

Merged

davidwendt mentioned this issue Aug 12, 2024

Fix all-empty input column for strings split APIs #16466

Merged

15 tasks

rapids-bot bot closed this as completed in #16466 Aug 13, 2024

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Aug 13, 2024

GregoryKimball removed this from libcudf Aug 16, 2024

ttnghia mentioned this issue Aug 21, 2024

[FEA] Revert work-around for string split with empty input NVIDIA/spark-rapids#11374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `split_record` output empty list for empty input string #16453

[BUG] `split_record` output empty list for empty input string #16453

ttnghia commented Jul 31, 2024

ttnghia commented Jul 31, 2024

davidwendt commented Aug 1, 2024

ttnghia commented Aug 1, 2024 •

edited

Loading

GregoryKimball commented Aug 1, 2024

ttnghia commented Aug 1, 2024

[BUG] split_record output empty list for empty input string #16453

[BUG] split_record output empty list for empty input string #16453

Comments

ttnghia commented Jul 31, 2024

ttnghia commented Jul 31, 2024

davidwendt commented Aug 1, 2024

ttnghia commented Aug 1, 2024 • edited Loading

GregoryKimball commented Aug 1, 2024

ttnghia commented Aug 1, 2024

[BUG] `split_record` output empty list for empty input string #16453

[BUG] `split_record` output empty list for empty input string #16453

ttnghia commented Aug 1, 2024 •

edited

Loading