-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support full u32 and u64 roundtrip through parquet #258
Conversation
Codecov Report
@@ Coverage Diff @@
## master #258 +/- ##
==========================================
+ Coverage 82.53% 82.54% +0.01%
==========================================
Files 162 162
Lines 43796 43862 +66
==========================================
+ Hits 36149 36208 +59
- Misses 7647 7654 +7
Continue to review full report at Codecov.
|
parquet/src/arrow/array_reader.rs
Outdated
@@ -380,6 +380,18 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> { | |||
} | |||
Arc::new(builder.finish()) as ArrayRef | |||
} | |||
ArrowType::UInt64 => match array.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was confusing to me (as it is manipulating the Arrow array after it has been created rather than manipulating the data prior to creating the array).
It also seems like it will result in a copy of all Int64 columns which may not be idea.
I wonder if you considered creating the Unit64Array
directly from the parquet data up here: https://github.com/apache/arrow-rs/pull/258/files#diff-0d6bed48d78c5a2472b7680a8185cabdc0bd259d6484e184439ed7830060661fR316-R319
instead?
That may be clearer to understand as well as more performant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see we always cast the array, see default case in line 383 (pre-patch) / 395 (post-patch). So we would always copy, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of the additional copy that comes from the arity
call -- specifically https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/arity.rs#L64
The arity
call will be in addition to any copying that cast
does, I think.
Also, when the types are the same, then cast is a noop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
The test failure
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is looking great. Thanks @crepererum !
@crepererum thanks for picking this up, and working on it. Do we have the same risk for |
550a48f
to
0a83a42
Compare
@nevi-me good catch. Indeed there was the same bug. Added a test and fixed as well. |
Will rebase once #267 is merged. |
#267 has been merged |
Seems logical since all other kernels are re-exported as well under this flat hierarchy.
- updates arrow to parquet type mapping to use reinterpret/overflow cast for u64<->i64 similar to what the C++ stack does - changes statistics calculation to account for the fact that u64 should be compared unsigned (as per spec) Fixes apache#254.
This is idential to the solution we now have for u64.
0a83a42
to
2710f31
Compare
ready :) |
|
||
/// Evaluate `a > b` according to underlying logical type. | ||
fn compare_greater(&self, a: &T::T, b: &T::T) -> bool { | ||
if let Some(LogicalType::INTEGER(int_type)) = self.descr.logical_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: not all implementations might write LogicalType, even though ConvertedType is deprecated. I had to think a bit about this, but it's fine because it's on the write-side, where we will always write LogicalType going forward.
If this was on the read side, we'd have to also check ConvertedType
as a fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue does this PR close?
Closes #254 .
Rationale for this change
Up until now u64 values larger than
i64::MAX
were silently truncated to "invalid" when going through parquet. With this change we follow the C++ stack andWhat changes are included in this PR?
is_signed
flag in statistics calculation (as per spec)Are there any user-facing changes?
Yes, users can now expect to have full u64 storage support. Old files should still work since we previously marked values larger than
i64::MAX
as invalid.