Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SchemaAdapterExec #2292

Closed
tustvold opened this issue Apr 20, 2022 · 4 comments
Closed

Add SchemaAdapterExec #2292

tustvold opened this issue Apr 20, 2022 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tustvold
Copy link
Contributor

tustvold commented Apr 20, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Part of #2079, related to #2170

Currently schema adaption is handled within each of the file format specific operators. As described in #2079 this has a number of drawbacks.

Describe the solution you'd like

I would like a SchemaAdapterExec that can be created with a provided Schema and a child ExecutionPlan. It would then adapt the schema of the batches returned by this inner ExecutionPlan to match the provided Schema, creating null columns as necessary.

This can likely reuse the existing SchemaAdapter

FYI @matthewmturner @thinkharderdev

@tustvold tustvold added enhancement New feature or request help wanted Extra attention is needed labels Apr 20, 2022
@alamb
Copy link
Contributor

alamb commented Apr 20, 2022

@thinkharderdev
Copy link
Contributor

I have some concerns about this. The problem is that this sort of assumes that we actually know at planning time what the schema for each individual file is in a ListingScan. And if you infer the schemas at planning and merge then together to get the table schema then that is true. But since this happens during planning and can be quite expensive, I suspect that real world use cases will leverage some sort of metadata catalog to get the merged schema for a logical table instead of re-deriving it for each query. In that case we have no idea what the individual file schemas are.

@tustvold
Copy link
Contributor Author

Thank you for bringing this up, to phrase it differently to check my understanding:

  • The ParquetExec, etc... needs to provide a schema at plan time
  • It needs to yield batches that match this schema
  • Depending on the catalog, the individual files might not match this schema, but must be compatible with it

I agree that there doesn't appear to be a way around this without the file operator handling the schema adaption. I will close this and update the other tickets accordingly. Thank you 👍

@thinkharderdev
Copy link
Contributor

  • The ParquetExec, etc... needs to provide a schema at plan time
  • It needs to yield batches that match this schema
  • Depending on the catalog, the individual files might not match this schema, but must be compatible with it

Yeah, exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants