-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Implement nested types: Add StructDtype and StructArray #45745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @Hoeze! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
ExtensionArray, | ||
) | ||
|
||
# from pandas.core.construction import extract_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove commented-out code wherever possible. where not possible, pls comment as to why
|
||
import logging | ||
|
||
log = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not, I'll remove it
"fields that this StructArray holds" | ||
# _field_names: List[str] | ||
# "ordered list of field names that this StructArray holds" | ||
_mask: Union[NoneType, np.ndarray] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mask: np.ndarray | None
, then can remove NoneType
im assuming you mean ExtensionArray, since PandasArray is a specific subclass. Use
It is a bit annoying that infer_dtype doesn't return a dtype or dtype-like-string. You might be better off with
core.dtypes.concat.concat_compat
I'm working on making MaskedArray a wrapper that can be used around arbitrary other arrays, but don't hold your breath. |
It might make sense just to have a ArrowStructArray directly and use the pyarrow.StructType as the backing dtype? |
yeah I agree, we are moving towards backing with arrow types rather than having new types 'roll their own'. This is the scalable longer term soln. There is already a ArrowBackedExtensionArray subclass that you can start from, and @mroeschke is doing some work on this to extend to numeric operations. I would be +1 on an implementation (even if limitted) in this way. |
Another thing I conclude from pola-rs/polars#3007:
I will continue working on this PR as far as my free time allows. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. (PS I have been working on integrating arrow into pandas which can hopefully use arrows StructDtype and scalars, but thanks for the effort on this so far!) |
This is my first attempt to implement a StructDtype that is based on NamedTuples.
My end-goal is to reach full interoperability with PyArrow/PySpark and deeply nested types.
In particular, this StructDtype and StructArray should make it much simpler to create custom data types.
Some implementation details:
StructArray
consists of anOrderedDict
("field_name" -> "pandas_type") and a mask if it is nullable.StructDtype
is accessible viadtype.type
and can be overridden with custom data types.StructDtype
can be arbitrarily nested. (I need to test this later.)There are a number of open issues to resolve, but I'd be very happy about some profound review or tips for improving.
pd.api.types.pandas_dtype(pd.api.types.infer_dtype(v))
not work for e.g. integers?NamedTuple
as scalar type? This makes it impossible to have e.g. "0" as field name.pa.array().to_pandas()
for anypa.StructType
?doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.