-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide Arrow Schema Hint to Parquet Reader - Alternative 2 #5939
Conversation
Thanks @efredine -- I started the CI on this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @efredine -- I think this PR and API looks good to me 👍
Another test that might be good would be to write a parquet file using the arrow reader with one schema (perhaps a StringArray) and then read the data back using another schema (perhaps a DictionaryArray
).
Adds a more detailed error message for incompatible columns. Adds nested fields to test_with_schema. Adds test for incompatible nested field. Updates documentation.
d310f92
to
3b4b2a1
Compare
I expanded the tests to include converting from an String to a Dictionary and also included a nested field. I also applied what I learnt about constructing record batches from an iterator ;-). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @efredine -- I think this PR looks great.
I'll leave it open for another day or two to allow time for others to comment
Thank you so much
/// writer.close().unwrap(); | ||
/// | ||
/// // Read the file back. | ||
/// // Supply a schema that interprets the Int32 column as a Timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -2620,6 +2732,275 @@ mod tests { | |||
assert_eq!(reader.schema(), schema_without_metadata); | |||
} | |||
|
|||
fn write_parquet_from_iter<I, F>(value: I) -> File |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very impressive set of negative tests 💯 Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, even better that you found a way to make the code change relatively small 👍
Which issue does this PR close?
Closes #5657. Alternative to #5671.
Rationale for this change
The parquet reader automatically uses an embedded arrow schema to hint type inference for decode. In particular if the hinted type is compatible with the underlying parquet type, it performs a cast. However, in situations where the writer was not an arrow writer the schema is not available and the arrow types are inferred from the parquet schema. This is not always desirable.
For example, you may want to cast an INT64 column to a Timestamp column or a Timestamp column to a different timezone. This PR allows a schema to be provided for use with type hinting.
What changes are included in this PR?
This is a draft PR for discussion. There are some API alternatives that can be considered. These are described in the section on user facing changes.
Currently, this PR adds
supplied_schema
field toArrowReaderOptions
and modifies thetry_new
method of ArrowReaderMetadata to use thesupplied_schema
instead of the metadata when one is provided. This option defaults toNone
so it should be backwards compatible?The option is provided with a new
with_schema
method onArrowReaderOptions
.Adds tests showing a simple success case. Once the approach is finalized, these can be expanded to cover off all the valid cast scenarios using an approach similar to
run_single_column_read_tests
.It also adds test showing two failure scenarios:
supplied_schema
must have exactly the same number of columns as the parquet schema or an error is thrown.Are there any user-facing changes?
A
with_schema
method is added to the ArrowReaderOptions. It is intended to be used as follows:It need to be provided as an option before the builder is constructed so it can be utilized to provide type hints when the metadata is being read.
There are different approaches that could be taken with the supplied_schema: