-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate fixes for Timestamp[MICROS] and infinite loop hang when reading parquet #1460
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @anliakho2
Can you please add some regression tests / update existing ones to cover the missed cases? I would like to ensure we don't accidentally break these features in the future
parquet/src/arrow/array_reader.rs
Outdated
ArrowType::Timestamp(ArrowTimeUnit::Nanosecond, ref tz) => { | ||
if let Some(LogicalType::TIMESTAMP(t)) = self.column_desc.logical_type() { | ||
match t.unit { | ||
TimeUnit::MICROS(_) => { | ||
let a = arrow::compute::cast(&array, &ArrowType::Timestamp(Microsecond, tz.clone()))?; | ||
arrow::compute::cast(&a, &ArrowType::Timestamp(Nanosecond, tz.clone()))? | ||
} | ||
TimeUnit::MILLIS(_) => { | ||
let a = arrow::compute::cast(&array, &ArrowType::Timestamp(Millisecond, tz.clone()))?; | ||
arrow::compute::cast(&a, &ArrowType::Timestamp(Nanosecond, tz.clone()))? | ||
} | ||
_ => arrow::compute::cast(&array, &target_type.clone())? | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks correct. But we need some tests to verify them.
Hi @anliakho2 -- just checking in on this one. I plan to cut a new arrow release next week (Thursday or Friday) and it would be great to include this issue as well. |
I would be happy to add some tests / fix the merge conflicts if you would like a hand getting this over the line. As @alamb says, it would be awesome to include this in the next release 😄 |
Codecov Report
@@ Coverage Diff @@
## master #1460 +/- ##
==========================================
- Coverage 82.83% 82.79% -0.04%
==========================================
Files 190 190
Lines 54957 54984 +27
==========================================
+ Hits 45521 45526 +5
- Misses 9436 9458 +22
Continue to review full report at Codecov.
|
Following investigation in #1459 I think we want to handle this differently, I will close this, cherry-pick the fix for #1458 into a separate PR and get another PR up to alter how we infer the schema of a parquet file to be compatible with what I think is a bug in pyarrow (https://issues.apache.org/jira/browse/ARROW-16184) |
Which issue does this PR close?
Closes #1459 , closes #1458 .
Rationale for this change
Fixing few issues experiences reading parquet files
What changes are included in this PR?
Add handling for the Timestamp(TimeUnit::MICROS) and Timestamp(TimeUnit::MILLIS) logical types.
Also handle RLE bit packed runs that don't have proper full packing.
Are there any user-facing changes?
no, except that more files would be read correctly.