Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DELTA_BINARY_PACKED encoder For Int32 and Int64 #382

Closed
wants to merge 3 commits into from

Conversation

ee-naveen
Copy link
Contributor

@ee-naveen ee-naveen commented Aug 8, 2023

Add DELTA_BINARY_PACKED encoder For Int32 and Int64 #375

  • Added Column Encoding in ParquetOptions class
  • Added DeltaBinaryPackedEncoder.Encode.Variations.tt file for encoding variation

Add DELTA_BINARY_PACKED encoder For Int32 and Int64
@eeabhishekmore
Copy link

@aloneguid please take a look when you can, thanks

@aloneguid
Copy link
Owner

Will do. I'll try to include this in the upcoming release.

@aloneguid aloneguid added this to the 4.16.0 milestone Aug 9, 2023
@aloneguid aloneguid self-requested a review August 14, 2023 14:22
/// The dictionary maps column names (as strings) to their corresponding encodings.
/// Each column can be encoded using a specific encoding method defined by the 'Encoding' enum.
/// </remarks>
public Dictionary<string, Encoding> ColumnEncoding { get; set; } = new Dictionary<string, Encoding>();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we use this?
Are only primitive column names at root supported as a key?
Are all Encodings respected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we use this?
Here is the sample.

...
var dataField = new DataField<long>("column1");
ParquetOptions options = new ParquetOptions() {
  ColumnEncoding = new Dictionary<string, Meta.Encoding>() {
      	{ "column1", Meta.Encoding.DELTA_BINARY_PACKED }
     }
};
...

Are all Encodings respected?

Currently only DELTA_BINARY_PACKED is supported for column encoding.
Here is the validation function

        private static bool EnsureColumnEncodingIsSupported(Encoding columnEncoding) {
            switch(columnEncoding) {
                case Encoding.DELTA_BINARY_PACKED:
                    return true;
                default:
                    throw new ArgumentException($"Not supported column encoding {columnEncoding}");
            }
        }

@@ -9,24 +12,120 @@ namespace Parquet.Encodings {
///
/// Supported Types: INT32, INT64
/// </summary>
static partial class DeltaBinaryPackedEncoder {
public static partial class DeltaBinaryPackedEncoder {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should stay private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to private.


<#= nt#> num = data[0];
ulong firstValue = ZigZagEncode(num);
WriteUnsignedVarInt(destination, firstValue);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of long, will this lose precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added WriteZigZagVarLong function and refactored (removed other functions)

}
}

private static ulong ZigZagEncode(long num) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be moved to OtherExtensions.cs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added GetZigZagEncoded extension method for long in OtherExtensions.cs

}
}
private static void WriteUnsignedVarInt(Stream destination, ulong value) {
byte[] rentedBuffer = BytePool.Rent(8);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI renting small buffers is expensive, probably more than creating them directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information. I've updated the approach and now creating small buffers directly.

* resolve pr comments

* Added ZigZagEncoded function to other extention class

* fixed typo
* - Removed duplicate code.
- used WriteUnsignedVarLong inside WriteUnsignedVarInt
* removed bytedpool for smaller buffer
@aloneguid
Copy link
Owner

Apologies this is so slow. During the review I've realised the other parts of this lib have a lot of duplications around variable-length and zigzag encoding methods. So the easiest way to do that for me was to copy this code in another branch and continue working on it. I've copied most of the code already.

The other thing about configuration I'm not sure about is whether it needs to be configurable. For instance, delta-rs project just uses delta binary packed encoding by default for int32 and int64, so it might be also worth making your code the default for writing to make things simple for a user. Thoughts?

@aloneguid
Copy link
Owner

Closing this in favor of merging into #385. Thanks again!

@aloneguid aloneguid closed this Aug 17, 2023
@ee-naveen
Copy link
Contributor Author

Thank you @aloneguid for picking this up.

Our current use case is fully compatible with the default delta binary packed encoding for integers.
If necessary, we can consider adding support for configuration in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants