Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] slice_strings producing incorrect results for some input strings #16768

Closed
jlowe opened this issue Sep 6, 2024 · 1 comment
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@jlowe
Copy link
Contributor

jlowe commented Sep 6, 2024

Describe the bug
After #16574 cudf::strings::slice_strings is producing incorrect results for some inputs. See repro steps for details.

Steps/Code to reproduce bug
Apply the following patch and run gtests/STRINGS_TEST. The added test fails but will pass if #16574 is reverted.

diff --git a/cpp/tests/strings/slice_tests.cpp b/cpp/tests/strings/slice_tests.cpp
index 52e439bd93..bcadf5a7d4 100644
--- a/cpp/tests/strings/slice_tests.cpp
+++ b/cpp/tests/strings/slice_tests.cpp
@@ -50,6 +50,26 @@ TEST_F(StringsSliceTest, Substring)
   CUDF_TEST_EXPECT_COLUMNS_EQUAL(*results, expected);
 }
 
+TEST_F(StringsSliceTest, Substring_Failure)
+{
+  cudf::test::fixed_width_column_wrapper<uint8_t> char_data{
+    0x20,0x13,0x54,0x64,0xc3,0xb2,0x57,0x26,0x70,0x64,0xc2,0x93,0x67,0x00,0xc3,0xb4,0xc2,0xa4,0xc3,0xa3,0x5d,0x37,0xc2,0x80,0x2b,0x3a,0xc2,0xb9,0x73,0x77,0xc2,0xb7,0x16,0xc2,0xa9,0x3a,0xc3,0xa1,0x2c,0x60,0x70,0x2c,0x20,0xc2,0x80,0x56,0x58,0xc3,0xa2,0xc3,0xaf,0x0a,0xc2,0xa9,0x06,0x1e,0xc3,0xae,0xc3,0xb1,0x71,0xc2,0x8f,0x7f,0x42,0xc2,0x8b,0xc2,0x9b,0x24,0x6e,0x17,0xc3,0x98,0xc2,0x98,0xc3,0x9d,0xc3,0x9e,0xc2,0xbc,0xc2,0xb0,0x14,0x4d,0xc2,0xa1,0xc3,0x9c,0x2c,0x20,0xc2,0xb2,0xc2,0x9a,0x6a,0xc3,0x84,0xc2,0xaf,0x72,0xc2,0xb6,0xc3,0x86,0x24,0x1c,0x08,0x1f,0x07,0x4f,0x7b,0x15,0xc2,0xb5,0xc3,0xb9,0xc3,0x84,0x46,0xc2,0xa7,0x19,0x37,0xc3,0x96,0xc3,0xac,0xc2,0x88,0x0f,0xc3,0xac,0xc3,0x95,0x2e,0x2c,0x20,0xc3,0x8a,0xc2,0x90,0x2b,0xc2,0x9f,0xc2,0xac,0x6e,0xc2,0xa0,0xc3,0xaa,0xc3,0xa3,0xc3,0x94,0x6f,0x7e,0xc3,0xa8,0x7d,0x3a,0xc3,0xbd,0xc3,0x99,0x57,0x11,0x63,0x6b,0xc3,0x91,0x15,0xc3,0xb4,0x33,0x5e,0x62,0xc2,0x8e,0xc2,0x99,0xc3,0x9c,0x2c,0x20,0x1d,0x47,0xc2,0x8e,0x27,0xc2,0x98,0xc2,0x82,0xc2,0x93,0x37,0xc2,0xa2,0xc3,0x97,0xc3,0xa0,0xc3,0xad,0xc2,0x94,0x08,0x39,0xc3,0x94,0xc3,0xad,0xc2,0xbc,0x2e,0x51,0x2d,0x73,0x79,0xc2,0x89,0x1a,0xc3,0x9d,0xc2,0x8b,0xc2,0xad,0xc2,0x89,0x01,0x2c,0x20,0x59,0x0d,0xc3,0x99,0x56,0xc2,0x85,0xc3,0x84,0xc2,0x88,0x58,0x25,0xc3,0x91,0x38,0xc2,0x9f,0xc3,0x9b,0x00,0xc3,0xa2,0xc2,0xbd,0xc2,0x82,0xc2,0x80,0xc2,0xa7,0xc2,0xbb,0x72,0xc2,0xb4,0xc3,0x8d,0x12,0x48,0xc2,0x80,0xc3,0x8c,0xc3,0xbe,0xc3,0x95,0x1d,0x2c,0x20,0x1a,0x45,0xc2,0xbd,0xc2,0xb9,0xc2,0x8a,0x6b,0xc3,0x8a,0xc2,0x9a,0x79,0xc2,0x98,0x3a,0xc2,0xb5,0x50,0x73,0xc2,0xb3,0x71,0xc2,0x85,0xc2,0xbf,0x72,0x1b,0xc2,0x95,0x47,0xc3,0x84,0xc3,0xb2,0xc3,0xbd,0x36,0xc2,0xb1,0xc3,0xab,0xc3,0x9e,0x74,
+    0x20,0xc3,0x8e,0xc2,0x8b,0xc3,0xbe,0xc3,0xba,0xc3,0x93,0xc3,0x95,0x61,0x5c,0xc2,0xa4,0x4b,0x70,0x1f,0xc3,0x96,0x03,0x4c,0x72,0x0e,0xc3,0xbe,0x1b,0xc3,0x8c,0xc2,0x82,0xc2,0x8f,0xc3,0xa2,0xc2,0x85,0x16,0xc3,0x82,0xc3,0x97,0x1c,0x28,0xc2,0xac};
+  cudf::test::fixed_width_column_wrapper<int32_t> offsets{0, 334, 382};
+  cudf::column_view strs{
+    cudf::data_type{cudf::type_id::STRING},
+    2,
+    cudf::column_view(char_data).data<uint8_t>(),
+    nullptr,
+    0,
+    0,
+    {offsets}
+  };
+  int start = 0;
+  auto results = cudf::strings::slice_strings(strs, start);
+  CUDF_TEST_EXPECT_COLUMNS_EQUAL(*results, strs);
+}
+
 class Parameters : public StringsSliceTest, public testing::WithParamInterface<cudf::size_type> {};
 
 TEST_P(Parameters, Substring)

The failure looks like this. Note that the last character of the second input string was mangled in the output.

/cudf/cpp/tests/utilities/column_utilities.cu:562: Failure
Failed
first difference: lhs[1] =  ÎþúÓÕa\¤KpÖLrþâ
Â×(�, rhs[1] =  ÎþúÓÕa\¤KpÖLrþâ
Â×(¬
Google Test trace:
/cudf/cpp/tests/strings/slice_tests.cpp:70:  <--  line of failure

Apologies for the unfriendly random input data, but this is what was captured from a RAPIDS Spark integration test that was failing, and trying to narrow the input to just the failing string oddly does not fail.

Expected behavior
Test passes as it did before.

@jlowe jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Sep 6, 2024
@davidwendt davidwendt self-assigned this Sep 8, 2024
@davidwendt
Copy link
Contributor

Closed by #16777

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

2 participants