Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudf-polars string/numeric casting #17076

Conversation

brandon-b-miller
Copy link
Contributor

Depends on #16991
Part of #17060

Implements cross casting from string <-> numeric types in cudf-polars

@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Oct 14, 2024
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner October 14, 2024 16:05
@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 14, 2024
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some questions. And I think the PR needs the latests changes from branch-24.12.

python/cudf_polars/cudf_polars/dsl/expressions/unary.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/expressions/unary.py Outdated Show resolved Hide resolved
@brandon-b-miller
Copy link
Contributor Author

@wence- in this case polars is testing for a specific exception when a string can't be cast to an integer. We're able to raise an exception in the right place on the cudf-polars side, but because of the way we wrap it in a polars.exceptions.ComputeError we assert False in the end.

FAILED py-polars/tests/unit/sql/test_cast.py::test_cast_errors[values5-values::int4-conversion from `str` to `i32` failed] - polars.exceptions.ComputeError: InvalidOperationError: Conversion from `str` failed.

The easiest thing would be to add this to the list of expected failures. That said, have we ever explored the idea of propagating the original exception back to the user in the case of runtime errors?

@wence-
Copy link
Contributor

wence- commented Oct 24, 2024

@wence- in this case polars is testing for a specific exception when a string can't be cast to an integer. We're able to raise an exception in the right place on the cudf-polars side, but because of the way we wrap it in a polars.exceptions.ComputeError we assert False in the end.

FAILED py-polars/tests/unit/sql/test_cast.py::test_cast_errors[values5-values::int4-conversion from `str` to `i32` failed] - polars.exceptions.ComputeError: InvalidOperationError: Conversion from `str` failed.

The easiest thing would be to add this to the list of expected failures. That said, have we ever explored the idea of propagating the original exception back to the user in the case of runtime errors?

I've thought about it, but I think it's fiddly because this execution happens inside polars rust, which doesn't know about python exception types, I think.

Can you open an issue/feature in polars to discuss this please?

python/cudf_polars/cudf_polars/dsl/expressions/unary.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/utils/dtypes.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/utils/dtypes.py Outdated Show resolved Hide resolved
Comment on lines 58 to 65
a = [
1,
2,
3,
4,
5,
6,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this will go on one line if we remove the final trailing comma.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For floating point types, can we please test:

  • negative numbers
  • +/- inf
  • nan
  • scientific notation

In addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, libcudf gives us Inf and polars gives us inf. :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does polars ingest Inf fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, just compare case insensitively

python/cudf_polars/cudf_polars/dsl/expressions/unary.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/expressions/unary.py Outdated Show resolved Hide resolved
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner October 25, 2024 17:04
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package labels Oct 25, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks. Can you just check if there are other places where we could now be using astype?

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ changes look good!

@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2024

Any idea why tests are failing here?

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (from this comment) you still intend to add this as an expected failure here?

brandon-b-miller and others added 2 commits October 29, 2024 09:58
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Comment on lines 117 to 121
elif (is_integral_not_bool(from_) and is_floating_point(to)) and (
plc.types.size_of(to) > plc.types.size_of(from_)
):
# Int64 fits in float64, but not in float32
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this one, the range of int64 is contained in the range of float32. I believe the mapping preserves the order.

Also, what about float to integral? I suppose it depends on what happens to the out of bounds values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the float to integral cases all fall into the last return False because the out of range values might lose their ordering in the cast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? If the range is just clamped, then you have no problem, ordering is preserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I expect the equivalent of this:

>>> ary = np.array([1, 2, float(np.iinfo('int64').max) + 1])
>>> ary
array([1.00000000e+00, 2.00000000e+00, 9.22337204e+18] # ordered
>>> ary.astype('int64')
array([1, 2, -9223372036854775808]) # not ordered

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what cudf does though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With wrap_numerical=True in the cast for polars, it clamps, AFAICT

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what cudf does though?

no, libcudf clamps. If we're ok encoding libcudf specific implementation behavior into this function we could pass the float to int cases.

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- is this current behavior considered buggy? It's almost like we should be raising unless wrap_numeric==True.

>>> df.collect()
shape: (3, 1)
┌───────────┐
│ a         │
│ ---       │
│ f32       │
╞═══════════╡
│ 1.0       │
│ 2.0       │
│ 9.2234e18 │
└───────────┘
>>> df.select(pl.col('a').cast(pl.Int64)).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/envs/cudf_dev/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2055, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: conversion from `f32` to `i64` failed in column 'a' for 1 out of 3 values: [9.2234e18]
>>> df.select(pl.col('a').cast(pl.Int64)).collect(engine=pl.GPUEngine())
shape: (3, 1)
┌─────────────────────┐
│ a                   │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 1                   │
│ 2                   │
│ 9223372036854775807 │
└─────────────────────┘

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes, we should barf for the cases where we're strictish mode because we haven't implemented those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I forsee here is that wrap_numerical=False, strict=True is the default. This means that by default the GPU backend will also have to scan during the float-int cast for the presence of these values and throw. This is shaping up to be a pattern that occurs in several places within the codebase, and it's probably not ideal to need to scan before every cast.

For now I have passed the float-int conversions through this function which retains the existing behavior, since regardless of if OOB values are nullified or clamped we'll retain order. I will raise a separate issue to discuss the proliferation of scanning as a result of polars defaults.

@wence-
Copy link
Contributor

wence- commented Nov 5, 2024

Thanks Brandon, happy with the current state.

Can you please write up a bit more detail in #17244 about the different casting modes, what cudf-polars currently does, and the routes to supporting them.

@vyasr
Copy link
Contributor

vyasr commented Nov 7, 2024

IIUC this PR is ready to merge and @brandon-b-miller just needs to add a bit more info to #17244, right?

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e4c52dd into rapidsai:branch-24.12 Nov 7, 2024
105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants