Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate fixes for Timestamp[MICROS] and infinite loop hang when reading parquet #1460

Closed
wants to merge 3 commits into from

Conversation

anliakho2
Copy link

@anliakho2 anliakho2 commented Mar 17, 2022

Which issue does this PR close?

Closes #1459 , closes #1458 .

Rationale for this change

Fixing few issues experiences reading parquet files

What changes are included in this PR?

Add handling for the Timestamp(TimeUnit::MICROS) and Timestamp(TimeUnit::MILLIS) logical types.
Also handle RLE bit packed runs that don't have proper full packing.

Are there any user-facing changes?

no, except that more files would be read correctly.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 17, 2022
@alamb alamb requested a review from sunchao March 17, 2022 07:53
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @anliakho2

Can you please add some regression tests / update existing ones to cover the missed cases? I would like to ensure we don't accidentally break these features in the future

Comment on lines 433 to 446
ArrowType::Timestamp(ArrowTimeUnit::Nanosecond, ref tz) => {
if let Some(LogicalType::TIMESTAMP(t)) = self.column_desc.logical_type() {
match t.unit {
TimeUnit::MICROS(_) => {
let a = arrow::compute::cast(&array, &ArrowType::Timestamp(Microsecond, tz.clone()))?;
arrow::compute::cast(&a, &ArrowType::Timestamp(Nanosecond, tz.clone()))?
}
TimeUnit::MILLIS(_) => {
let a = arrow::compute::cast(&array, &ArrowType::Timestamp(Millisecond, tz.clone()))?;
arrow::compute::cast(&a, &ArrowType::Timestamp(Nanosecond, tz.clone()))?
}
_ => arrow::compute::cast(&array, &target_type.clone())?
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks correct. But we need some tests to verify them.

@anliakho2
Copy link
Author

@viirya @alamb Will do tests shortly, need some time to finish up tasks at work :)

@alamb
Copy link
Contributor

alamb commented Apr 8, 2022

Hi @anliakho2 -- just checking in on this one. I plan to cut a new arrow release next week (Thursday or Friday) and it would be great to include this issue as well.

@tustvold
Copy link
Contributor

I would be happy to add some tests / fix the merge conflicts if you would like a hand getting this over the line. As @alamb says, it would be awesome to include this in the next release 😄

@anliakho2
Copy link
Author

@alamb Sorry was travelling and just came back... I think I could get back to it in time. @tustvold I would also appreciate your help on this, so feel free

@codecov-commenter
Copy link

Codecov Report

Merging #1460 (cf24786) into master (c9549bb) will decrease coverage by 0.03%.
The diff coverage is 18.51%.

@@            Coverage Diff             @@
##           master    #1460      +/-   ##
==========================================
- Coverage   82.83%   82.79%   -0.04%     
==========================================
  Files         190      190              
  Lines       54957    54984      +27     
==========================================
+ Hits        45521    45526       +5     
- Misses       9436     9458      +22     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader.rs 85.01% <16.66%> (-2.06%) ⬇️
parquet/src/encodings/rle.rs 92.19% <33.33%> (-0.44%) ⬇️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
parquet/src/encodings/encoding.rs 93.37% <0.00%> (-0.19%) ⬇️
arrow/src/array/transform/mod.rs 86.46% <0.00%> (+0.11%) ⬆️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9549bb...cf24786. Read the comment docs.

@tustvold
Copy link
Contributor

Following investigation in #1459 I think we want to handle this differently, I will close this, cherry-pick the fix for #1458 into a separate PR and get another PR up to alter how we infer the schema of a parquet file to be compatible with what I think is a bug in pyarrow (https://issues.apache.org/jira/browse/ARROW-16184)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
7 participants