-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add type_metadata
property to ColumnBase
#8333
Add type_metadata
property to ColumnBase
#8333
Conversation
if isinstance(self, cudf.core.column.CategoricalColumn): | ||
metadata.update( | ||
{"categories": self.categories, "ordered": self.ordered} | ||
) | ||
if isinstance(self, cudf.core.column.StructColumn): | ||
metadata["field_keys"] = self.dtype.fields.keys() | ||
if isinstance(self, cudf.core.column.DecimalColumn): | ||
metadata["precision"] = self.dtype.precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Structs / Decimals the info is already attached to the dtype here. For Categoricals I believe the categories are attached to the dtype as well, where we should always have the info to reconstruct a column given its dtype, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, but wouldn't we need access to self
to do that?
The main function of these changes would be that we could make a copy of a column that is not in scope - for example, when unpacking a PackedColumns
object from #8153, we need to apply the correct dtype metadata to each resulting column, but we probably don't want to have to access the original DataFrame (before packing) to do this.
With these changes, we could store each columns' type_metadata
in the PackedColumns
object, and use that for reconstruction instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point is why do we need a new function associated with the Column class to do this when all of the information already exists on the dtype? I.E. I believe we already have __serialize__
and __deserialize__
implemented for all of our dtypes, so we should already be able to do serialize(column.dtype)
to shove into the PackedColumns
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense - in that case, I think a good roadmap would be to add the serialize/deserialize methods to the StructDtype
so that all the types affected here are serializable, and then implement the relevant as_*_column()
methods so that we can cast the columns with the dtype alone.
EDIT: just realized we shouldn't need any casting for the struct/decimal columns - we only need to apply the metadata. Would we want a utility function to apply metadata from a Dtype
for this, or is it suitable to just do this conditionally in unpack()
?
Closing this in favor of just including the columns' dtypes in the |
I think the logic will end up being very similar to |
Agreed - I think that a general function that is able to apply metadata to a column taking a dtype as input could work for all these cases, although I'm not sure how to handling typing if we want it to work for Arrow arrays. Would fleshing out the casting functions here be helpful for this purpose? cudf/python/cudf/cudf/core/column/column.py Lines 965 to 993 in dd5eecd
|
I think all three cases could be implemented in terms of something like class ColumnBase:
def _apply_type_metadata(self, dtype: Dtype) -> ColumnBase:
# return a new column composed of the data from `self`
# and metadata from `dtype`
pass
def _copy_type_metadata(self, other: ColumnBase) -> ColumnBase:
# copy the type metadata from the column `other` onto `self`
return self._apply_type_metadata(other.dtype)
def _copy_type_metadata_from_arrow(self, other: pa.Array) -> ColumnBase:
# copy the type metadata from the pa.Array `other` onto `self`
return self._apply_type_metadata(cudf_dtype_from_pa_type(other.type)) |
I'm not necessarily advocating that we need three distinct methods though. I think just having the first one is enough. |
The difficult thing of reusing shared recursive backbone is that both cudf and pyarrow have nested column structure, where each column has there own |
Right, I imagine that |
I could be wrong. For concrete example, say we have a column of list of list of ints: >>> x = cudf.Series([[[1, 2, 3]]]) It is true that >>> x.dtype
ListDtype(ListDtype(int64)) However, >>> s._column.children[1].dtype
ListDtype(int64) This is best illustrated if the nested type is not only a For pyarrow it's pretty much the same. >>> x = pa.array([[[1, 2, 3]]])
>>> x.type
ListType(list<item: list<item: int64>>)
>>> x.values.type
ListType(list<item: int64>) |
Based on discussion on #8333: - adds `_with_type_metadata()` to `ColumnBase` to return a new column with the metadata of `dtype` applied - removes `_copy_type_metadata[_from_arrow]()` and uses this function in their place These changes would be helpful for #8153, as we want to be able to copy metadata from one column to another using only the dtype object. Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Ashwin Srinath (https://github.com/shwina) - Michael Wang (https://github.com/isVoid) URL: #8373
Adds a
type_metadata
property toColumnBase
, which returns a dictionary with the dtype of the column along with metadata required to construct a copy of this column (if any).Also adds method
_copy_type_metadata_from_dict()
to allow aColumnBase
to apply a type metadata dictionary to itself, and refactors_copy_type_metadata()
to use this method/property when copying metadata from one column to another.The motivator here is #8153, which requires us to construct copies of columns with dtype metadata without actually having access to the original columns.