Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Schema metadata expectations #12736

Open
Tracked by #12733
alamb opened this issue Oct 3, 2024 · 6 comments
Open
Tracked by #12733

Document Schema metadata expectations #12736

alamb opened this issue Oct 3, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 3, 2024

Is your feature request related to a problem or challenge?

There is an (implicit) assumption that metadata attached to Schema is preserved during certain operations in DataFusion.

However, this expectation is clearly not well tested or documented (e.g. see #12733)

Describe the solution you'd like

I would like the assumptions documented

Describe alternatives you've considered

I suggest documentation on in https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html that explains the high level assumptions

Then add a note /link to that section from the optimizers:
https://docs.rs/datafusion/latest/datafusion/optimizer/trait.AnalyzerRule.html
https://docs.rs/datafusion/latest/datafusion/optimizer/trait.OptimizerRule.html
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/trait.PhysicalOptimizerRule.html

My understanding of the high level assumptions are:

  • schema level metadata: always passed through
  • field level metadata: when there is a clear 1-1 correspondence from an input column with metadata to an output column, the metadata should be preserved

Examples

  • PROJECT(a, b+c) --> field metadata ona should be preserved, no field metadata on b+c
  • SUM(a) .. GROUP BY b --> field metadata on b is preserved, not on a

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 3, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2024

I believe @wiedld plans to work on this

@wiedld
Copy link
Contributor

wiedld commented Oct 3, 2024

take

@alamb
Copy link
Contributor Author

alamb commented Oct 21, 2024

I will take a shot at documenting this

@alamb
Copy link
Contributor Author

alamb commented Nov 13, 2024

#13305 (comment) has some additional context

@westonpace
Copy link
Member

I'm not sure if this is related or not but I encountered an error during optimization:

Error: join_selection caused by Internal error: PhysicalOptimizer rule 'join_selection' failed, due to generate a different schema, original schema: Schema { fields: [Field { name: "a", ...  }], metadata: {metadata_from_table_one}, new schema: Schema { fields: [Field { name: "a", ... }], metadata: {metadata_from_table_two}. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

I think the join operator is either returning the metadata from the first or second input and so an optimization that swaps the join order results in the output schema of the join operator changing and this causes the optimizer to bail.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

That definitely sounds like a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants