-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate column factories to pylibcudf #15257
Migrate column factories to pylibcudf #15257
Conversation
FYI @brandon-b-miller we should address the TODO I noted in this commit in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure looks good overall two things:
- It feels like the fused types for empty column construction can be simpler.
- It would be great if the datatype conversions didn't rely on quite so many third-party libraries.
elif isinstance(pyarrow_object, pa.ListType): | ||
return DataType(type_id.LIST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: The pylibcudf DataType
object doesn't have a concept of nested lists, but the pyarrow datatype does: for pyarrow, the datatype is List(element_type)
, whereas for pylibcudf it's List
. Does that potentially cause problems here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the "nestedness" is a property of the column and not as much a property of the type, at least from libcudf's perspective.
For the from_arrow
case, there's nothing to return except for DataType(type_id.LIST)
. For to_arrow
, we have the column to inspect for the non dtype cases, but not for the dtype only case. Since we can't figure out exactly what kind of pa.List
or pa.Struct
to return, I chose to error there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There simply isn't a 1-1 mapping between pyarrow and pylibcudf data types for nested types because in pylibcudf (read: libcudf) the nested types are encoded in the column as Brandon pointed out. So I don't think there's much else we can do in the from_arrow
case.
For the to_arrow
case, maybe we should accept a kwarg column
that can be inspected in the case of list/struct types. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Walrus (walri?) abound in my attempt here bffe500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return DataType( | ||
SUPPORTED_NUMPY_TO_LIBCUDF_TYPES.get( | ||
np.dtype(pyarrow_object.to_pandas_dtype())) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: So conversion, much wow.
We would like to get to a state where pylibcudf doesn't depend on arrow, numpy, and, in this case, transitively pandas. Can we not introduce that dependency here by explicitly enumerating the handling of all types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely shouldn't have a pandas dependence in pylibcudf. pyarrow dependence should be isolated to the interop module, which should only be loaded conditionally. numpy dependence I was less worried about, but from a quick search the only usage I currently see in pylibcudf is here, so it seems like we might be able to strip that out fairly painlessly too.
This PR adds an extra np.dtype call in column.pyx in place of enumerating. I'm in the same boat as Lawrence and think we should avoid using numpy if it's easy to enumerate. I don't know how painful that will be to maintain for all dtypes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b3e934c reverts the earlier stab at things and adds two explicit mappings.
…s.pxd Co-authored-by: Lawrence Mitchell <wence@gmx.li>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small things, but generally looks good now. Thanks!
elif isinstance(pyarrow_object, pa.ListType): | ||
return DataType(type_id.LIST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny niggles, looks great.
/merge |
This PR implements
column_factories.hpp
usingpylibcudf
and migrates the cuDF cython to use them cc @vyasr