`cudf-polars` string/numeric casting #17076

brandon-b-miller · 2024-10-14T16:05:34Z

Depends on #16991
Part of #17060

Implements cross casting from string <-> numeric types in cudf-polars

Matt711

I left some questions. And I think the PR needs the latests changes from branch-24.12.

python/cudf_polars/cudf_polars/dsl/expressions/unary.py

brandon-b-miller · 2024-10-17T14:56:43Z

@wence- in this case polars is testing for a specific exception when a string can't be cast to an integer. We're able to raise an exception in the right place on the cudf-polars side, but because of the way we wrap it in a polars.exceptions.ComputeError we assert False in the end.

FAILED py-polars/tests/unit/sql/test_cast.py::test_cast_errors[values5-values::int4-conversion from `str` to `i32` failed] - polars.exceptions.ComputeError: InvalidOperationError: Conversion from `str` failed.

The easiest thing would be to add this to the list of expected failures. That said, have we ever explored the idea of propagating the original exception back to the user in the case of runtime errors?

wence- · 2024-10-24T11:09:06Z

@wence- in this case polars is testing for a specific exception when a string can't be cast to an integer. We're able to raise an exception in the right place on the cudf-polars side, but because of the way we wrap it in a polars.exceptions.ComputeError we assert False in the end.
FAILED py-polars/tests/unit/sql/test_cast.py::test_cast_errors[values5-values::int4-conversion from `str` to `i32` failed] - polars.exceptions.ComputeError: InvalidOperationError: Conversion from `str` failed.
The easiest thing would be to add this to the list of expected failures. That said, have we ever explored the idea of propagating the original exception back to the user in the case of runtime errors?

I've thought about it, but I think it's fiddly because this execution happens inside polars rust, which doesn't know about python exception types, I think.

Can you open an issue/feature in polars to discuss this please?

python/cudf_polars/cudf_polars/dsl/expressions/unary.py

python/cudf_polars/cudf_polars/utils/dtypes.py

wence- · 2024-10-24T11:43:54Z

python/cudf_polars/tests/expressions/test_stringfunction.py

+    a = [
+        1,
+        2,
+        3,
+        4,
+        5,
+        6,
+    ]


Surely this will go on one line if we remove the final trailing comma.

For floating point types, can we please test:

negative numbers

+/- inf

nan

scientific notation

In addition.

Argh, libcudf gives us Inf and polars gives us inf. :(

Does polars ingest Inf fine?

If so, just compare case insensitively

python/cudf_polars/cudf_polars/dsl/expressions/unary.py

wence-

This looks good to me, thanks. Can you just check if there are other places where we could now be using astype?

mhaseeb123

C++ changes look good!

vyasr · 2024-10-28T18:36:10Z

Any idea why tests are failing here?

Matt711

I think (from this comment) you still intend to add this as an expected failure here?

python/cudf_polars/cudf_polars/containers/column.py

python/cudf_polars/cudf_polars/testing/plugin.py

python/cudf_polars/cudf_polars/utils/dtypes.py

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

wence- · 2024-10-29T15:15:03Z

python/cudf_polars/cudf_polars/utils/dtypes.py

+    elif (is_integral_not_bool(from_) and is_floating_point(to)) and (
+        plc.types.size_of(to) > plc.types.size_of(from_)
+    ):
+        # Int64 fits in float64, but not in float32
+        return True


I'm not sure about this one, the range of int64 is contained in the range of float32. I believe the mapping preserves the order.

Also, what about float to integral? I suppose it depends on what happens to the out of bounds values.

Correct the float to integral cases all fall into the last return False because the out of range values might lose their ordering in the cast.

Why? If the range is just clamped, then you have no problem, ordering is preserved.

Because I expect the equivalent of this:

>>> ary = np.array([1, 2, float(np.iinfo('int64').max) + 1]) >>> ary array([1.00000000e+00, 2.00000000e+00, 9.22337204e+18] # ordered >>> ary.astype('int64') array([1, 2, -9223372036854775808]) # not ordered

Is that what cudf does though?

With wrap_numerical=True in the cast for polars, it clamps, AFAICT

Is that what cudf does though?

no, libcudf clamps. If we're ok encoding libcudf specific implementation behavior into this function we could pass the float to int cases.

@wence- is this current behavior considered buggy? It's almost like we should be raising unless wrap_numeric==True.

>>> df.collect() shape: (3, 1) ┌───────────┐ │ a │ │ --- │ │ f32 │ ╞═══════════╡ │ 1.0 │ │ 2.0 │ │ 9.2234e18 │ └───────────┘ >>> df.select(pl.col('a').cast(pl.Int64)).collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/envs/cudf_dev/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2055, in collect return wrap_df(ldf.collect(callback)) ^^^^^^^^^^^^^^^^^^^^^ polars.exceptions.InvalidOperationError: conversion from `f32` to `i64` failed in column 'a' for 1 out of 3 values: [9.2234e18] >>> df.select(pl.col('a').cast(pl.Int64)).collect(engine=pl.GPUEngine()) shape: (3, 1) ┌─────────────────────┐ │ a │ │ --- │ │ i64 │ ╞═════════════════════╡ │ 1 │ │ 2 │ │ 9223372036854775807 │ └─────────────────────┘

Probably yes, we should barf for the cases where we're strictish mode because we haven't implemented those

The issue I forsee here is that wrap_numerical=False, strict=True is the default. This means that by default the GPU backend will also have to scan during the float-int cast for the presence of these values and throw. This is shaping up to be a pattern that occurs in several places within the codebase, and it's probably not ideal to need to scan before every cast.

For now I have passed the float-int conversions through this function which retains the existing behavior, since regardless of if OOB values are nullified or clamped we'll retain order. I will raise a separate issue to discuss the proliferation of scanning as a result of polars defaults.

wence- · 2024-11-05T17:57:36Z

Thanks Brandon, happy with the current state.

Can you please write up a bit more detail in #17244 about the different casting modes, what cudf-polars currently does, and the routes to supporting them.

vyasr · 2024-11-07T07:33:02Z

IIUC this PR is ready to merge and @brandon-b-miller just needs to add a bit more info to #17244, right?

brandon-b-miller · 2024-11-07T14:26:02Z

/merge

brandon-b-miller added 4 commits October 14, 2024 06:56

fix

3d0dc8a

passing

59ceb03

more tests, needs refactor

be0fae9

refactor

209b906

brandon-b-miller added feature request New feature or request non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Oct 14, 2024

brandon-b-miller requested a review from a team as a code owner October 14, 2024 16:05

brandon-b-miller requested review from vyasr and Matt711 October 14, 2024 16:05

github-actions bot added the Python Affects Python cuDF API. label Oct 14, 2024

Matt711 reviewed Oct 15, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/expressions/unary.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/dsl/expressions/unary.py Outdated Show resolved Hide resolved

Matt711 and others added 4 commits October 15, 2024 13:19

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

0258478

address reviews

c9199ec

update tests

7feb1f3

handle runtime conversion failure

6d699ac

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

f29a918

wence- requested changes Oct 24, 2024

View reviewed changes

brandon-b-miller added 3 commits October 24, 2024 13:46

moving things

735c9e3

implement and use is_numeric_not_bool

9f2cc18

update tests

9ecac41

brandon-b-miller requested a review from a team as a code owner October 25, 2024 17:04

brandon-b-miller requested a review from mhaseeb123 October 25, 2024 17:04

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package labels Oct 25, 2024

wence- approved these changes Oct 25, 2024

View reviewed changes

mhaseeb123 approved these changes Oct 25, 2024

View reviewed changes

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

740af73

brandon-b-miller added 2 commits October 28, 2024 07:41

test, implement, and use is_order_preserving_cast

cd80083

small fix

f98f635

Matt711 approved these changes Oct 28, 2024

View reviewed changes

brandon-b-miller added 2 commits October 28, 2024 14:22

pass test_string_from_float

7f75375

add failing test to plugin xfail list

9c9d395

wence- reviewed Oct 29, 2024

View reviewed changes

python/cudf_polars/cudf_polars/containers/column.py Show resolved Hide resolved

wence- reviewed Oct 29, 2024

View reviewed changes

python/cudf_polars/cudf_polars/testing/plugin.py Outdated Show resolved Hide resolved

wence- reviewed Oct 29, 2024

View reviewed changes

python/cudf_polars/cudf_polars/utils/dtypes.py Outdated Show resolved Hide resolved

brandon-b-miller and others added 2 commits October 29, 2024 09:58

Update python/cudf_polars/cudf_polars/testing/plugin.py

00bd36c

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

Update python/cudf_polars/cudf_polars/utils/dtypes.py

c06f984

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

wence- reviewed Oct 29, 2024

View reviewed changes

brandon-b-miller added 3 commits October 29, 2024 09:03

minor fixups

a68011a

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

09d8e48

merge/resolve

334eef4

brandon-b-miller mentioned this pull request Nov 4, 2024

[FEA] Support strict=False casting in cudf-polars #17244

Open

brandon-b-miller added 2 commits November 4, 2024 15:59

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

3c45ffb

allow float to int

cf62714

github-actions bot assigned brandon-b-miller Nov 5, 2024

small fixes

0344f53

brandon-b-miller and others added 3 commits November 5, 2024 14:23

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

9cfc487

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

fdd5abc

Merge branch 'branch-24.12' into cudf-polars-string-numeric-casting

69432a1

github-actions bot assigned vyasr Nov 7, 2024

rapids-bot bot merged commit e4c52dd into rapidsai:branch-24.12 Nov 7, 2024
105 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cudf-polars` string/numeric casting #17076

`cudf-polars` string/numeric casting #17076

brandon-b-miller commented Oct 14, 2024

Matt711 left a comment

brandon-b-miller commented Oct 17, 2024

wence- commented Oct 24, 2024

wence- Oct 24, 2024

wence- Oct 24, 2024

brandon-b-miller Oct 25, 2024

wence- Oct 25, 2024

wence- Oct 25, 2024

wence- left a comment

mhaseeb123 left a comment

vyasr commented Oct 28, 2024

Matt711 left a comment

wence- Oct 29, 2024

brandon-b-miller Oct 29, 2024

wence- Oct 29, 2024

brandon-b-miller Oct 29, 2024

wence- Oct 29, 2024

wence- Oct 29, 2024

brandon-b-miller Oct 29, 2024 •

edited

Loading

brandon-b-miller Nov 1, 2024 •

edited

Loading

wence- Nov 4, 2024

brandon-b-miller Nov 5, 2024

wence- commented Nov 5, 2024

vyasr commented Nov 7, 2024

brandon-b-miller commented Nov 7, 2024

+                  a = [
+,
+,
+,
+,
+,
+,
+                  ]

cudf-polars string/numeric casting #17076

cudf-polars string/numeric casting #17076

Conversation

brandon-b-miller commented Oct 14, 2024

Matt711 left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Oct 17, 2024

wence- commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

mhaseeb123 left a comment

Choose a reason for hiding this comment

vyasr commented Oct 28, 2024

Matt711 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

brandon-b-miller Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Nov 5, 2024

vyasr commented Nov 7, 2024

brandon-b-miller commented Nov 7, 2024

`cudf-polars` string/numeric casting #17076

`cudf-polars` string/numeric casting #17076

brandon-b-miller Oct 29, 2024 •

edited

Loading

brandon-b-miller Nov 1, 2024 •

edited

Loading