-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354
ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
Previously, if an Arrow schema was present in the Parquet metadata, that schema would always be returned when requesting all columns via `parquet_to_arrow_schema` and would never be returned when requesting a subset of columns via `parquet_to_arrow_schema_by_columns`. Now, if a valid Arrow schema is present in the Parquet metadata and a subset of columns is requested by Parquet column index, the `parquet_to_arrow_schema_by_columns` function will try to find a column of the same name in the Arrow schema first, and then fall back to the Parquet schema for that column if there isn't an Arrow Field for that column. This is part of what is needed to be able to restore Arrow types like LargeUtf8 from Parquet.
…es back from Parquet
f65b2ba
to
30e3e41
Compare
Ok @nevi-me, I rebased this PR on the branch and I think this is ready for review now. It pushes more type information from the arrow metadata schema down into the reading code... the LargeBinary and LargeUtf8 tests are still failing, but no longer because their schemas don't match ;) |
@carols10cents did you see integer32llc@55a049b from about 10 minutes before you force-pushed? |
@nevi-me I saw it just after :) I'm looking at it now! I don't think there are conflicts, and I think my last commit is addressing a different issue than your last commit? |
Yes, it addresses something different, as mine only really adds an alternative projection on columns, and preserves the metadata |
@nevi-me I added your commit onto this branch! |
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Merged |
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I think this will bring the Rust implementation more in line with C++, but I'm not certain.
I tried removing the
#[ignore]
attributes from theLargeArray
andLargeUtf8
tests, but they're still failing because the schemas don't match yet-- it looks like this code will need to be changed as well.That
build_array_reader
function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?