-
-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DELTA_BINARY_PACKED encoder For Int32 and Int64 #382
Conversation
Add DELTA_BINARY_PACKED encoder For Int32 and Int64
@aloneguid please take a look when you can, thanks |
Will do. I'll try to include this in the upcoming release. |
/// The dictionary maps column names (as strings) to their corresponding encodings. | ||
/// Each column can be encoded using a specific encoding method defined by the 'Encoding' enum. | ||
/// </remarks> | ||
public Dictionary<string, Encoding> ColumnEncoding { get; set; } = new Dictionary<string, Encoding>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we use this?
Are only primitive column names at root supported as a key?
Are all Encoding
s respected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we use this?
Here is the sample.
...
var dataField = new DataField<long>("column1");
ParquetOptions options = new ParquetOptions() {
ColumnEncoding = new Dictionary<string, Meta.Encoding>() {
{ "column1", Meta.Encoding.DELTA_BINARY_PACKED }
}
};
...
Are all Encodings respected?
Currently only DELTA_BINARY_PACKED is supported for column encoding.
Here is the validation function
private static bool EnsureColumnEncodingIsSupported(Encoding columnEncoding) {
switch(columnEncoding) {
case Encoding.DELTA_BINARY_PACKED:
return true;
default:
throw new ArgumentException($"Not supported column encoding {columnEncoding}");
}
}
@@ -9,24 +12,120 @@ namespace Parquet.Encodings { | |||
/// | |||
/// Supported Types: INT32, INT64 | |||
/// </summary> | |||
static partial class DeltaBinaryPackedEncoder { | |||
public static partial class DeltaBinaryPackedEncoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should stay private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to private.
|
||
<#= nt#> num = data[0]; | ||
ulong firstValue = ZigZagEncode(num); | ||
WriteUnsignedVarInt(destination, firstValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of long, will this lose precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added WriteZigZagVarLong function and refactored (removed other functions)
} | ||
} | ||
|
||
private static ulong ZigZagEncode(long num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be moved to OtherExtensions.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added GetZigZagEncoded extension method for long in OtherExtensions.cs
} | ||
} | ||
private static void WriteUnsignedVarInt(Stream destination, ulong value) { | ||
byte[] rentedBuffer = BytePool.Rent(8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI renting small buffers is expensive, probably more than creating them directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the information. I've updated the approach and now creating small buffers directly.
* resolve pr comments * Added ZigZagEncoded function to other extention class * fixed typo
* - Removed duplicate code. - used WriteUnsignedVarLong inside WriteUnsignedVarInt * removed bytedpool for smaller buffer
Apologies this is so slow. During the review I've realised the other parts of this lib have a lot of duplications around variable-length and zigzag encoding methods. So the easiest way to do that for me was to copy this code in another branch and continue working on it. I've copied most of the code already. The other thing about configuration I'm not sure about is whether it needs to be configurable. For instance, |
Closing this in favor of merging into #385. Thanks again! |
Thank you @aloneguid for picking this up. Our current use case is fully compatible with the default delta binary packed encoding for integers. |
Add DELTA_BINARY_PACKED encoder For Int32 and Int64 #375