ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354

carols10cents · 2020-10-06T02:02:56Z

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I think this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the #[ignore] attributes from the LargeArray and LargeUtf8 tests, but they're still failing because the schemas don't match yet-- it looks like this code will need to be changed as well.

That build_array_reader function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

github-actions · 2020-10-06T02:06:54Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2020-10-06T12:08:21Z

https://issues.apache.org/jira/browse/ARROW-10168

Previously, if an Arrow schema was present in the Parquet metadata, that schema would always be returned when requesting all columns via `parquet_to_arrow_schema` and would never be returned when requesting a subset of columns via `parquet_to_arrow_schema_by_columns`. Now, if a valid Arrow schema is present in the Parquet metadata and a subset of columns is requested by Parquet column index, the `parquet_to_arrow_schema_by_columns` function will try to find a column of the same name in the Arrow schema first, and then fall back to the Parquet schema for that column if there isn't an Arrow Field for that column. This is part of what is needed to be able to restore Arrow types like LargeUtf8 from Parquet.

…es back from Parquet

carols10cents · 2020-10-07T18:27:26Z

Ok @nevi-me, I rebased this PR on the branch and I think this is ready for review now. It pushes more type information from the arrow metadata schema down into the reading code... the LargeBinary and LargeUtf8 tests are still failing, but no longer because their schemas don't match ;)

nevi-me · 2020-10-07T18:33:40Z

@carols10cents did you see integer32llc@55a049b from about 10 minutes before you force-pushed?

carols10cents · 2020-10-07T18:35:18Z

@nevi-me I saw it just after :) I'm looking at it now! I don't think there are conflicts, and I think my last commit is addressing a different issue than your last commit?

nevi-me · 2020-10-07T18:42:06Z

@nevi-me I saw it just after :) I'm looking at it now! I don't think there are conflicts, and I think my last commit is addressing a different issue than your last commit?

Yes, it addresses something different, as mine only really adds an alternative projection on columns, and preserves the metadata

carols10cents · 2020-10-07T18:57:51Z

@nevi-me I added your commit onto this branch!

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

nevi-me · 2020-10-07T22:17:25Z

Merged

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

jorgecarleitao added Component: Rust Component: Parquet labels Oct 6, 2020

carols10cents changed the title ~~[Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available~~ ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available Oct 6, 2020

nevi-me self-requested a review October 6, 2020 15:57

carols10cents and others added 3 commits October 7, 2020 09:50

run cargo +stable fmt (and clippy)

332f440

ARROW-10168: [Rust] [Parquet] Convert LargeString and LargeBinary typ…

30e3e41

…es back from Parquet

carols10cents force-pushed the schema-roundtrip branch from f65b2ba to 30e3e41 Compare October 7, 2020 18:25

carols10cents marked this pull request as ready for review October 7, 2020 18:25

add option to project root columns from schema

69b4743

nevi-me approved these changes Oct 7, 2020

View reviewed changes

nevi-me closed this Oct 7, 2020

tustvold mentioned this pull request May 6, 2022

Parquet Treats Embedded Arrow Schema as Authoritative apache/arrow-rs#1663

Closed

asfimport mentioned this pull request Jan 16, 2021

[Rust] [Parquet] Extend arrow schema conversion to projected fields #26176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354

carols10cents commented Oct 6, 2020

github-actions bot commented Oct 6, 2020

github-actions bot commented Oct 6, 2020

carols10cents commented Oct 7, 2020

nevi-me commented Oct 7, 2020

carols10cents commented Oct 7, 2020

nevi-me commented Oct 7, 2020

carols10cents commented Oct 7, 2020

nevi-me commented Oct 7, 2020

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354

Conversation

carols10cents commented Oct 6, 2020

github-actions bot commented Oct 6, 2020

github-actions bot commented Oct 6, 2020

carols10cents commented Oct 7, 2020

nevi-me commented Oct 7, 2020

carols10cents commented Oct 7, 2020

nevi-me commented Oct 7, 2020

carols10cents commented Oct 7, 2020

nevi-me commented Oct 7, 2020