Skip to content

Implement nested types: Add StructDtype and StructArray #45745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Hoeze
Copy link

@Hoeze Hoeze commented Jan 31, 2022

This is my first attempt to implement a StructDtype that is based on NamedTuples.
My end-goal is to reach full interoperability with PyArrow/PySpark and deeply nested types.
In particular, this StructDtype and StructArray should make it much simpler to create custom data types.

Some implementation details:

  • A StructArray consists of an OrderedDict ("field_name" -> "pandas_type") and a mask if it is nullable.
  • The scalar type of the StructDtype is accessible via dtype.type and can be overridden with custom data types.
  • In theory, all functions should be recursive s.t. a StructDtype can be arbitrarily nested. (I need to test this later.)

There are a number of open issues to resolve, but I'd be very happy about some profound review or tips for improving.

  • I kept the dtype and array classes in a single python file to make reviews simpler for now. These will need to be split up into the correct files later.
  • The lines https://github.com/Hoeze/pandas/blob/d651395143cc4cadbb7d18fae2f98ef2354710ab/pandas/core/arrays/struct.py#L42-L88 are very ugly hacks for some methods where I could not find the corresponding Python functions for:
    • How to allocate empty Pandas arrays?
    • Why does pd.api.types.pandas_dtype(pd.api.types.infer_dtype(v)) not work for e.g. integers?
    • Why is there no standard way to concatenate Pandas arrays?
  • How bad is it to have NamedTuple as scalar type? This makes it impossible to have e.g. "0" as field name.
  • How would I overwrite pa.array().to_pandas() for any pa.StructType?
  • Where should nullability be handled? In Pyarrow, a dtype itself is not nullable but an array or StructField can be. In pandas, it seems like nullability needs to be implemented in every array class again.

@pep8speaks
Copy link

Hello @Hoeze! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 6:89: E501 line too long (110 > 88 characters)
Line 42:1: E302 expected 2 blank lines, found 1
Line 107:89: E501 line too long (97 > 88 characters)
Line 304:89: E501 line too long (95 > 88 characters)
Line 320:89: E501 line too long (108 > 88 characters)
Line 334:89: E501 line too long (102 > 88 characters)
Line 344:89: E501 line too long (102 > 88 characters)
Line 349:89: E501 line too long (94 > 88 characters)
Line 412:89: E501 line too long (98 > 88 characters)
Line 417:89: E501 line too long (91 > 88 characters)
Line 444:89: E501 line too long (98 > 88 characters)
Line 457:89: E501 line too long (101 > 88 characters)
Line 462:89: E501 line too long (96 > 88 characters)
Line 513:17: E126 continuation line over-indented for hanging indent
Line 518:89: E501 line too long (101 > 88 characters)
Line 542:89: E501 line too long (113 > 88 characters)
Line 552:89: E501 line too long (95 > 88 characters)
Line 637:89: E501 line too long (108 > 88 characters)
Line 649:89: E501 line too long (92 > 88 characters)
Line 654:89: E501 line too long (108 > 88 characters)
Line 660:89: E501 line too long (98 > 88 characters)
Line 680:89: E501 line too long (117 > 88 characters)
Line 734:89: E501 line too long (99 > 88 characters)
Line 737:89: E501 line too long (106 > 88 characters)
Line 805:89: E501 line too long (95 > 88 characters)
Line 816:18: E711 comparison to None should be 'if cond is not None:'
Line 918:89: E501 line too long (92 > 88 characters)
Line 946:89: E501 line too long (118 > 88 characters)
Line 996:89: E501 line too long (99 > 88 characters)
Line 997:89: E501 line too long (89 > 88 characters)
Line 1001:89: E501 line too long (101 > 88 characters)
Line 1011:89: E501 line too long (89 > 88 characters)
Line 1030:25: E117 over-indented
Line 1117:1: W391 blank line at end of file

Line 56:89: E501 line too long (104 > 88 characters)

@Hoeze Hoeze mentioned this pull request Jan 31, 2022
@Hoeze Hoeze changed the title initial commit for StructDtype and StructArray Implement nested types: Add StructDtype and StructArray Jan 31, 2022
ExtensionArray,
)

# from pandas.core.construction import extract_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove commented-out code wherever possible. where not possible, pls comment as to why


import logging

log = logging.getLogger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not, I'll remove it

"fields that this StructArray holds"
# _field_names: List[str]
# "ordered list of field names that this StructArray holds"
_mask: Union[NoneType, np.ndarray]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mask: np.ndarray | None, then can remove NoneType

@jbrockmendel
Copy link
Member

How to allocate empty Pandas arrays?

im assuming you mean ExtensionArray, since PandasArray is a specific subclass. Use dtype.empty(shape)

Why does pd.api.types.pandas_dtype(pd.api.types.infer_dtype(v)) not work for e.g. integers?

It is a bit annoying that infer_dtype doesn't return a dtype or dtype-like-string. You might be better off with infer_dtype_from(v)

Why is there no standard way to concatenate Pandas arrays?

core.dtypes.concat.concat_compat

In pandas, it seems like nullability needs to be implemented in every array class again.

I'm working on making MaskedArray a wrapper that can be used around arbitrary other arrays, but don't hold your breath.

@mroeschke
Copy link
Member

It might make sense just to have a ArrowStructArray directly and use the pyarrow.StructType as the backing dtype?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2022

It might make sense just to have a ArrowStructArray directly and use the pyarrow.StructType as the backing dtype?

yeah I agree, we are moving towards backing with arrow types rather than having new types 'roll their own'. This is the scalable longer term soln. There is already a ArrowBackedExtensionArray subclass that you can start from, and @mroeschke is doing some work on this to extend to numeric operations.

I would be +1 on an implementation (even if limitted) in this way.

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Mar 6, 2022
@Hoeze
Copy link
Author

Hoeze commented Apr 3, 2022

Another thing I conclude from pola-rs/polars#3007:
We should not use NamedTuple as baseclass but rather subclass tuple with a custom __getitem__():

  • tuple[idx] where idx is an integer -> positional indexing of the field
  • tuple[col] where col is a string -> get field by name
    This way, we can keep support for arbitrary column names.

I will continue working on this PR as far as my free time allows.

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. (PS I have been working on integrating arrow into pandas which can hopefully use arrows StructDtype and scalars, but thanks for the effort on this so far!)

@mroeschke mroeschke closed this Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Pandas StructDType
5 participants