-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudf-polars
string/numeric casting
#17076
cudf-polars
string/numeric casting
#17076
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some questions. And I think the PR needs the latests changes from branch-24.12.
@wence- in this case polars is testing for a specific exception when a string can't be cast to an integer. We're able to raise an exception in the right place on the
The easiest thing would be to add this to the list of expected failures. That said, have we ever explored the idea of propagating the original exception back to the user in the case of runtime errors? |
I've thought about it, but I think it's fiddly because this execution happens inside polars rust, which doesn't know about python exception types, I think. Can you open an issue/feature in polars to discuss this please? |
a = [ | ||
1, | ||
2, | ||
3, | ||
4, | ||
5, | ||
6, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely this will go on one line if we remove the final trailing comma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For floating point types, can we please test:
- negative numbers
- +/- inf
- nan
- scientific notation
In addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, libcudf gives us Inf
and polars gives us inf
. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does polars ingest Inf
fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, just compare case insensitively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks. Can you just check if there are other places where we could now be using astype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ changes look good!
Any idea why tests are failing here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
elif (is_integral_not_bool(from_) and is_floating_point(to)) and ( | ||
plc.types.size_of(to) > plc.types.size_of(from_) | ||
): | ||
# Int64 fits in float64, but not in float32 | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this one, the range of int64 is contained in the range of float32. I believe the mapping preserves the order.
Also, what about float to integral? I suppose it depends on what happens to the out of bounds values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the float to integral cases all fall into the last return False
because the out of range values might lose their ordering in the cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? If the range is just clamped, then you have no problem, ordering is preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I expect the equivalent of this:
>>> ary = np.array([1, 2, float(np.iinfo('int64').max) + 1])
>>> ary
array([1.00000000e+00, 2.00000000e+00, 9.22337204e+18] # ordered
>>> ary.astype('int64')
array([1, 2, -9223372036854775808]) # not ordered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what cudf does though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With wrap_numerical=True
in the cast for polars, it clamps, AFAICT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what cudf does though?
no, libcudf clamps. If we're ok encoding libcudf specific implementation behavior into this function we could pass the float to int cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- is this current behavior considered buggy? It's almost like we should be raising unless wrap_numeric==True
.
>>> df.collect()
shape: (3, 1)
┌───────────┐
│ a │
│ --- │
│ f32 │
╞═══════════╡
│ 1.0 │
│ 2.0 │
│ 9.2234e18 │
└───────────┘
>>> df.select(pl.col('a').cast(pl.Int64)).collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/envs/cudf_dev/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2055, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: conversion from `f32` to `i64` failed in column 'a' for 1 out of 3 values: [9.2234e18]
>>> df.select(pl.col('a').cast(pl.Int64)).collect(engine=pl.GPUEngine())
shape: (3, 1)
┌─────────────────────┐
│ a │
│ --- │
│ i64 │
╞═════════════════════╡
│ 1 │
│ 2 │
│ 9223372036854775807 │
└─────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes, we should barf for the cases where we're strictish mode because we haven't implemented those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I forsee here is that wrap_numerical=False
, strict=True
is the default. This means that by default the GPU backend will also have to scan during the float-int cast for the presence of these values and throw. This is shaping up to be a pattern that occurs in several places within the codebase, and it's probably not ideal to need to scan before every cast.
For now I have passed the float-int conversions through this function which retains the existing behavior, since regardless of if OOB values are nullified or clamped we'll retain order. I will raise a separate issue to discuss the proliferation of scanning as a result of polars defaults.
Thanks Brandon, happy with the current state. Can you please write up a bit more detail in #17244 about the different casting modes, what cudf-polars currently does, and the routes to supporting them. |
IIUC this PR is ready to merge and @brandon-b-miller just needs to add a bit more info to #17244, right? |
/merge |
Depends on #16991
Part of #17060
Implements cross casting from string <-> numeric types in
cudf-polars