Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Improved support for nested / structured types (Struct , List, ListArray, and other Composite types) #2326

Open
15 of 27 tasks
alamb opened this issue Apr 24, 2022 · 17 comments
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 24, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This ticket is designed to capture the work needed to properly support Arrow Struct types in DataFusion

https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html says that nested types are not supported; The are not fully supported, but there are parts of the support already present such as a way to serialize them via ArrowWriter and using field["nested_field"] syntax

Describe the solution you'd like
Research, and describe / implement what is else remains for proper support.

Array (ListArray) support:

Map (MapArray) support:

Struct (StructArray) support:

Union (UnionArray) support

Other

Known issues so far:

@nl5887
Copy link
Contributor

nl5887 commented Jun 9, 2022

This https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/file_format/mod.rs#L238 is one reason of errors related to column projection. It compares the complete enum, failing on different field order.

Arrow has a method to compare data types (https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L674). I think this method should me made public, and used in above.

Currently datafusion uses match_field_names (default true), https://github.com/apache/arrow-rs/blob/master/arrow/src/record_batch.rs#L153 causing the error.

@alamb
Copy link
Contributor Author

alamb commented Jun 10, 2022

Thanks for the investigation @nl5887 -- that sounds definitely plausible. Feel free to file a PR with proposed changed -- we would love to review them

@nl5887
Copy link
Contributor

nl5887 commented Jun 26, 2022

This one is also related: #2581

@tv42
Copy link
Contributor

tv42 commented Feb 22, 2023

Reminder to write docs: #1222

@alamb alamb changed the title Properly support arrow Struct types / Composite type in DataFusion [EPIC] Properly support arrow Struct types / Composite type in DataFusion Apr 24, 2023
@alamb alamb changed the title [EPIC] Properly support arrow Struct types / Composite type in DataFusion [EPIC] Improved support for nested / structured types (Struct , List, ListArray, and other Composite types) May 25, 2023
@alexwilcoxson-rel
Copy link

Potential to add to list #7012

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2024

We are starting to make progress on struct support --

There is a PR up to support named_struct #9743 and work afoot to support nicer literal syntax: #9820 🚀

@toaiduongdh
Copy link

Hi, i think unnest support for struct can be an item in this epic right?

@alamb
Copy link
Contributor Author

alamb commented Apr 27, 2024

Hi, i think unnest support for struct can be an item in this epic right?

That would make sense to me -- is there a ticket that describes what this means?

@duongcongtoai
Copy link
Contributor

i created a ticket: #10264

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2024

i created a ticket: #10264

Thank you. I added this to the list in the ticket description

@duongcongtoai
Copy link
Contributor

duongcongtoai commented May 25, 2024

I added an issue to support recursive unnest: #10660, i think it shoul belong to this epic

@alamb
Copy link
Contributor Author

alamb commented May 25, 2024

I added an issue to support recursive unnest: #10660, i think it shoul belong to this epic

Added

@goldmedal
Copy link
Contributor

I added an issue to check the duplicate or null name for struct: #11438

@Throne3d
Copy link

I think #11445 is related to this epic

@alamb
Copy link
Contributor Author

alamb commented Jul 14, 2024

I think #11445 is related to this epic

Thank you -- added

@TheBuilderJR
Copy link

Right now datafusion doesn't support struct evolution very well. Imagine you have a struct named customData with field someOptionEnabled in one parquet file, later down the line you add a new field newAddedOption to the customData struct in another parquet file. Currently when you try and SELECT * FROM table you'll get this error:

{"message":"Failed to collect DataFrame batches: Plan(\"Cannot cast file schema field customData of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) to table schema field of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \\\"newAddedOption\\\", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])\")","status":"error"}

Feels like we should handle this more gracefully. cc @alamb

I'm happy to make contributions if someone can point me to the right places to look.

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

Right now datafusion doesn't support struct evolution very well. Imagine you have a struct named customData with field someOptionEnabled in one parquet file, later down the line you add a new field newAddedOption to the customData struct in another parquet file. Currently when you try and SELECT * FROM table you'll get this error:

{"message":"Failed to collect DataFrame batches: Plan(\"Cannot cast file schema field customData of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) to table schema field of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \\\"newAddedOption\\\", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])\")","status":"error"}

Feels like we should handle this more gracefully. cc @alamb

I agree

I'm happy to make contributions if someone can point me to the right places to look.

My suggestion is to start with filing a ticket with a self contained reproducer (either rust code or SQL) that shows what you are trying to do.

This would likely become part of the test of any code improvement we make, as well as providing some more detail for other contributors to help point to the right place in the code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants