Implement nested types: Add StructDtype and StructArray #45745

Hoeze · 2022-01-31T21:57:09Z

This is my first attempt to implement a StructDtype that is based on NamedTuples.
My end-goal is to reach full interoperability with PyArrow/PySpark and deeply nested types.
In particular, this StructDtype and StructArray should make it much simpler to create custom data types.

Some implementation details:

A StructArray consists of an OrderedDict ("field_name" -> "pandas_type") and a mask if it is nullable.
The scalar type of the StructDtype is accessible via dtype.type and can be overridden with custom data types.
In theory, all functions should be recursive s.t. a StructDtype can be arbitrarily nested. (I need to test this later.)

There are a number of open issues to resolve, but I'd be very happy about some profound review or tips for improving.

I kept the dtype and array classes in a single python file to make reviews simpler for now. These will need to be split up into the correct files later.
The lines https://github.com/Hoeze/pandas/blob/d651395143cc4cadbb7d18fae2f98ef2354710ab/pandas/core/arrays/struct.py#L42-L88 are very ugly hacks for some methods where I could not find the corresponding Python functions for:
- How to allocate empty Pandas arrays?
- Why does pd.api.types.pandas_dtype(pd.api.types.infer_dtype(v)) not work for e.g. integers?
- Why is there no standard way to concatenate Pandas arrays?
How bad is it to have NamedTuple as scalar type? This makes it impossible to have e.g. "0" as field name.
How would I overwrite pa.array().to_pandas() for any pa.StructType?
Where should nullability be handled? In Pyarrow, a dtype itself is not nullable but an array or StructField can be. In pandas, it seems like nullability needs to be implemented in every array class again.

closes ENH: Pandas StructDType #40652
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pep8speaks · 2022-01-31T21:57:13Z

Hello @Hoeze! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pandas/core/arrays/struct.py:

Line 6:89: E501 line too long (110 > 88 characters)
Line 42:1: E302 expected 2 blank lines, found 1
Line 107:89: E501 line too long (97 > 88 characters)
Line 304:89: E501 line too long (95 > 88 characters)
Line 320:89: E501 line too long (108 > 88 characters)
Line 334:89: E501 line too long (102 > 88 characters)
Line 344:89: E501 line too long (102 > 88 characters)
Line 349:89: E501 line too long (94 > 88 characters)
Line 412:89: E501 line too long (98 > 88 characters)
Line 417:89: E501 line too long (91 > 88 characters)
Line 444:89: E501 line too long (98 > 88 characters)
Line 457:89: E501 line too long (101 > 88 characters)
Line 462:89: E501 line too long (96 > 88 characters)
Line 513:17: E126 continuation line over-indented for hanging indent
Line 518:89: E501 line too long (101 > 88 characters)
Line 542:89: E501 line too long (113 > 88 characters)
Line 552:89: E501 line too long (95 > 88 characters)
Line 637:89: E501 line too long (108 > 88 characters)
Line 649:89: E501 line too long (92 > 88 characters)
Line 654:89: E501 line too long (108 > 88 characters)
Line 660:89: E501 line too long (98 > 88 characters)
Line 680:89: E501 line too long (117 > 88 characters)
Line 734:89: E501 line too long (99 > 88 characters)
Line 737:89: E501 line too long (106 > 88 characters)
Line 805:89: E501 line too long (95 > 88 characters)
Line 816:18: E711 comparison to None should be 'if cond is not None:'
Line 918:89: E501 line too long (92 > 88 characters)
Line 946:89: E501 line too long (118 > 88 characters)
Line 996:89: E501 line too long (99 > 88 characters)
Line 997:89: E501 line too long (89 > 88 characters)
Line 1001:89: E501 line too long (101 > 88 characters)
Line 1011:89: E501 line too long (89 > 88 characters)
Line 1030:25: E117 over-indented
Line 1117:1: W391 blank line at end of file

In the file pandas/tests/dtypes/test_structtype.py:

Line 56:89: E501 line too long (104 > 88 characters)

jbrockmendel · 2022-01-31T22:39:32Z

pandas/core/arrays/struct.py

+    ExtensionArray,
+)
+
+# from pandas.core.construction import extract_array


pls remove commented-out code wherever possible. where not possible, pls comment as to why

jbrockmendel · 2022-01-31T22:39:45Z

pandas/core/arrays/struct.py

+
+import logging
+
+log = logging.getLogger(__name__)


is this used?

Actually not, I'll remove it

jbrockmendel · 2022-01-31T22:40:33Z

pandas/core/arrays/struct.py

+    "fields that this StructArray holds"
+    # _field_names: List[str]
+    # "ordered list of field names that this StructArray holds"
+    _mask: Union[NoneType, np.ndarray]


_mask: np.ndarray | None, then can remove NoneType

jbrockmendel · 2022-01-31T22:44:30Z

How to allocate empty Pandas arrays?

im assuming you mean ExtensionArray, since PandasArray is a specific subclass. Use dtype.empty(shape)

Why does pd.api.types.pandas_dtype(pd.api.types.infer_dtype(v)) not work for e.g. integers?

It is a bit annoying that infer_dtype doesn't return a dtype or dtype-like-string. You might be better off with infer_dtype_from(v)

Why is there no standard way to concatenate Pandas arrays?

core.dtypes.concat.concat_compat

In pandas, it seems like nullability needs to be implemented in every array class again.

I'm working on making MaskedArray a wrapper that can be used around arbitrary other arrays, but don't hold your breath.

mroeschke · 2022-01-31T22:45:07Z

It might make sense just to have a ArrowStructArray directly and use the pyarrow.StructType as the backing dtype?

jreback · 2022-03-06T23:25:33Z

It might make sense just to have a ArrowStructArray directly and use the pyarrow.StructType as the backing dtype?

yeah I agree, we are moving towards backing with arrow types rather than having new types 'roll their own'. This is the scalable longer term soln. There is already a ArrowBackedExtensionArray subclass that you can start from, and @mroeschke is doing some work on this to extend to numeric operations.

I would be +1 on an implementation (even if limitted) in this way.

Hoeze · 2022-04-03T12:35:41Z

Another thing I conclude from pola-rs/polars#3007:
We should not use NamedTuple as baseclass but rather subclass tuple with a custom __getitem__():

tuple[idx] where idx is an integer -> positional indexing of the field
tuple[col] where col is a string -> get field by name
This way, we can keep support for arbitrary column names.

I will continue working on this PR as far as my free time allows.

mroeschke · 2022-06-10T21:24:58Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. (PS I have been working on integrating arrow into pandas which can hopefully use arrows StructDtype and scalars, but thanks for the effort on this so far!)

initial commit for StructDtype and StructArray

d651395

Hoeze mentioned this pull request Jan 31, 2022

ENH: Pandas StructDType #40652

Closed

Hoeze changed the title ~~initial commit for StructDtype and StructArray~~ Implement nested types: Add StructDtype and StructArray Jan 31, 2022

jbrockmendel reviewed Jan 31, 2022

View reviewed changes

Hoeze mentioned this pull request Feb 10, 2022

add support for writing parquet files kipoi/kipoi#642

Merged

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Mar 6, 2022

mroeschke closed this Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement nested types: Add StructDtype and StructArray #45745

Implement nested types: Add StructDtype and StructArray #45745

Uh oh!

Hoeze commented Jan 31, 2022

Uh oh!

pep8speaks commented Jan 31, 2022

Uh oh!

jbrockmendel Jan 31, 2022

Uh oh!

jbrockmendel Jan 31, 2022

Uh oh!

Hoeze Feb 5, 2022

Uh oh!

jbrockmendel Jan 31, 2022

Uh oh!

jbrockmendel commented Jan 31, 2022

Uh oh!

mroeschke commented Jan 31, 2022

Uh oh!

jreback commented Mar 6, 2022

Uh oh!

Hoeze commented Apr 3, 2022

Uh oh!

mroeschke commented Jun 10, 2022

Uh oh!

Uh oh!

Uh oh!

Implement nested types: Add StructDtype and StructArray #45745

Implement nested types: Add StructDtype and StructArray #45745

Uh oh!

Conversation

Hoeze commented Jan 31, 2022

Uh oh!

pep8speaks commented Jan 31, 2022

Uh oh!

jbrockmendel Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

Hoeze Feb 5, 2022

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jan 31, 2022

Uh oh!

mroeschke commented Jan 31, 2022

Uh oh!

jreback commented Mar 6, 2022

Uh oh!

Hoeze commented Apr 3, 2022

Uh oh!

mroeschke commented Jun 10, 2022

Uh oh!

Uh oh!