Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354

Conversation

carols10cents
Copy link
Contributor

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I think this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the #[ignore] attributes from the LargeArray and LargeUtf8 tests, but they're still failing because the schemas don't match yet-- it looks like this code will need to be changed as well.

That build_array_reader function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

@github-actions
Copy link

github-actions bot commented Oct 6, 2020

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@carols10cents carols10cents changed the title [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available Oct 6, 2020
@github-actions
Copy link

github-actions bot commented Oct 6, 2020

@nevi-me nevi-me self-requested a review October 6, 2020 15:57
carols10cents and others added 3 commits October 7, 2020 09:50
Previously, if an Arrow schema was present in the Parquet metadata, that
schema would always be returned when requesting all columns via
`parquet_to_arrow_schema` and would never be returned when requesting a
subset of columns via `parquet_to_arrow_schema_by_columns`.

Now, if a valid Arrow schema is present in the Parquet metadata and a
subset of columns is requested by Parquet column index, the
`parquet_to_arrow_schema_by_columns` function will try to find a column
of the same name in the Arrow schema first, and then fall back to the
Parquet schema for that column if there isn't an Arrow Field for that
column.

This is part of what is needed to be able to restore Arrow types like
LargeUtf8 from Parquet.
@carols10cents carols10cents marked this pull request as ready for review October 7, 2020 18:25
@carols10cents
Copy link
Contributor Author

Ok @nevi-me, I rebased this PR on the branch and I think this is ready for review now. It pushes more type information from the arrow metadata schema down into the reading code... the LargeBinary and LargeUtf8 tests are still failing, but no longer because their schemas don't match ;)

@nevi-me
Copy link
Contributor

nevi-me commented Oct 7, 2020

@carols10cents did you see integer32llc@55a049b from about 10 minutes before you force-pushed?

@carols10cents
Copy link
Contributor Author

@nevi-me I saw it just after :) I'm looking at it now! I don't think there are conflicts, and I think my last commit is addressing a different issue than your last commit?

@nevi-me
Copy link
Contributor

nevi-me commented Oct 7, 2020

@nevi-me I saw it just after :) I'm looking at it now! I don't think there are conflicts, and I think my last commit is addressing a different issue than your last commit?

Yes, it addresses something different, as mine only really adds an alternative projection on columns, and preserves the metadata

@carols10cents
Copy link
Contributor Author

@nevi-me I added your commit onto this branch!

nevi-me added a commit that referenced this pull request Oct 7, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@nevi-me
Copy link
Contributor

nevi-me commented Oct 7, 2020

Merged

@nevi-me nevi-me closed this Oct 7, 2020
nevi-me added a commit that referenced this pull request Oct 7, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me added a commit that referenced this pull request Oct 12, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me added a commit that referenced this pull request Oct 16, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me added a commit to nevi-me/arrow that referenced this pull request Oct 17, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes apache#8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me added a commit that referenced this pull request Oct 25, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me added a commit that referenced this pull request Oct 27, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
nevi-me added a commit that referenced this pull request Oct 28, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…m Parquet metadata when available

@nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes apache#8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants