-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate string find
operations to pylibcudf
#15604
Migrate string find
operations to pylibcudf
#15604
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just some feedback on the tests. In general I think you can find a lot more pyarrow equivalents than you think! Also, be on the lookout for potential fixtures!
target_col = plc.interop.from_arrow( | ||
pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"]) | ||
) | ||
expected = pa.array([0, 0, 0, 0, 0, 0, None, 0, None, 0], type=pa.int32()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected = pa.Array.from_pandas( | ||
string_col.to_pandas().str.rfind(target), type=pa.int32() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to avoid using pandas. I'd rather hardcode an expected output than use pandas for this testing, I think? Another option could be doing something clever like reversing the string (pyarrow does support that) and then doing a find_substring
. I'm not sure, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped to python stringops in 6cd8747
expected = pa.Array.from_pandas( | ||
string_col.to_pandas().str.contains(target) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to check count_substring(...) > 1
roughly.
expected = pa.array( | ||
[False, True, True, True, True, True, None, True, None, True] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected = pa.array( | ||
[True, True, True, True, True, True, None, True, None, True] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just call starts_with
once per row? It's obviously inefficient, but would work and save you from hardcoding the outputs in case we change the input. Same goes for test_ends_with_column
.
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there; I have a couple more suggestions for improving the test suite. Some of them are also about building in the expectations for how tests should look so that we stick to the right patterns going forward.
@pytest.fixture | ||
def find_target_column(): | ||
return plc.interop.from_arrow( | ||
pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"]) | ||
) | ||
|
||
|
||
@pytest.fixture | ||
def contains_target_column(): | ||
return plc.interop.from_arrow( | ||
pa.array(["a", "d", "F", "j", "m", "q", None, "R", None, "w"]) | ||
) | ||
|
||
|
||
@pytest.fixture | ||
def starts_with_target_column(): | ||
return plc.interop.from_arrow( | ||
pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"]) | ||
) | ||
|
||
|
||
@pytest.fixture | ||
def ends_with_target_column(): | ||
return plc.interop.from_arrow( | ||
pa.array(["C", "e", "I", "j", "m", "q", None, "T", None, "w"]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these actually need to be different? If it's easy to rewrite the inputs to the tests to use the same fixture, we might as well. It makes patterns easier to spot when there's only a single piece of data used for all the testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concatenated these into a mega fixture
@pytest.mark.parametrize("target", ["a", ""]) | ||
def test_rfind(plc_col, target): | ||
got = plc.strings.find.rfind( | ||
plc_col, DeviceScalar(target, dtype=np.dtype("object")).c_value, 0, -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be create cudf._lib.DeviceScalar
objects here. Can we just make a pylibcudf Scalar directly? For now, it's OK if that requires using the interop module. We'll eventually need to add proper constructors but for now removing usage of cudf from pylibcudf tests is a higher priority.
Also, while you're at it... the scalar could be another fixture :) You can parametrize the fixture directly like this. This would again require ensuring that the same fixture can be used for all of the tests, but the way that you've written the tests I'm fairly certain that they should pass generically for arbitrary input data. This way you can also have a single place where you add new edge cases you want to test (like the empty string above).
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
/merge |
This PR implements libcudf's string
find.hpp
and migrates existing cuDF cython to leverage it.