-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify the compression API in Velox #7471
Comments
cc: @FelixYBW |
@pedroerp Any plan to unified the codec in Velox? It doesn't make sense we have 3 codec definition in current Velox. In Gluten shuffle we currently use arrow's. In Spill we use Folly. It's the first step to add de-/compression accelerators in Velox |
CC: @majetideepak |
CC: @xiaoxmeng |
@yaqi-zhao Based on discussion in #7445 , as Arrow codec currently doesn't support async compression, I would suggest we unify the compression API first, and extend the API to add interfaces for async. |
I have tried to add interface for async decompression (class AsyncDecompressor ) in PR(#6176) . What I worried now is that how to design an interface for different hardware since the storage of the async status may be varied for different hardware. Another concern is that the upper-level caller may need massive change to apply the async interface. |
Could you initiate a design proposal to facilitate the review? Thanks! @yaqi-zhao , @marin-ma |
@george-gu-2021 . I only have an intial idea about how to design Interface for IAA and you can refer to "class AsyncDecompressor in PR(#6176)".
With the concern I proposed above, I don't konw how to design a universal API for all hardware. Maybe @marin-ma @FelixYBW already have some insights already. |
@yaqi-zhao It took me sometime to review #6176. Here are some initial thoughts:
|
@george-gu-2021 @yaqi-zhao The detailed implementation of the Async compression API is somehow beyond the scope of this discussion. I would suggest we unify the compression module in the first step, and then we can extend different compression codecs in the common module, therefore make the IAA codec available for shuffle, spill, etc. Besides, we can not only have the async IAA codec but also create a synchronized one. |
Do you have any insights on the proposal of replacing dwrf/folly compression with arrow? Arrow compression API is already mature and can satisfy the requirements of other modules, so I'm planning to work on a PR to replace the codecs under velox/common/compression with Arrow codec implementation, but there are two paths we can choose:
I would appreciate your suggestions and comments. Thanks! |
That would be nice. I didn't realize we have 3 implementations. |
@marin-ma I'm a bit concerned about using Arrow's implementation directly. At a minimum we'd need to re-write it to match coding style and design patterns of Velox and replace status with Velox-specific. Can we maybe start by replacing dwrf/folly compression with velox/common/compression. Are there any differences in these 2 implementations? |
Thank you for your advice!
Yes. The supported codecs are different, which have been listed in the issue description. And dwrf supports configuring a few codec-specific parameters, while folly doesn't. |
Got it. What are the "few codec-specific parameters"? If we need these, then maybe you can start looking at replacing these 2 implementations with a single unified one. |
velox/velox/dwio/common/compression/Compression.h Lines 68 to 90 in 9cfa1a6
|
@marin-ma thanks for putting together the proposal. A suggested sequence of steps could be:
|
@pedroerp How do you think we define a async (de)compressor like this?
|
I would think returning
And the implementation for IAACompressor:
For caller, in the stage of pre-decompressing the page:
For caller, getting decompressed result:
|
I plan to submit a series of PRs for the implementation of compression codecs API. The submission sequence may change based on practical considerations. The proposed order is:
|
Late to the discussion, but +1. CC: @karteekmurthys |
I've created some PRs to add the unified API, which are now ready for review. The initial patch #7589 has been created for the first goal. |
@marin-ma @FelixYBW Hi, sorry I didn't see this in time. Just some background: The Arrow codec was only used in Parquet writer (in the datasink, not datasource), which we recently moved into Velox (Velox used to call into Arrow Parquet writer). Parquet reader (in the data source) was an in house one that only uses Velox codec, and with #5914 and #6105, it is now using the common |
Thank you. @marin-ma will update the PR. |
there are 3 PRs planed as show in comments of 8084, copied below. 8084 is the first one. we are waiting for the second one.
|
@yaqi-zhao According to #8084 (comment) the |
@marin-ma Got it, thanks! |
@yaqi-zhao is that blocking this work? That fell off of my radar, but I can take a stab at it soon if this is needed here. |
I see. I think my question in whether this Issue depends on the "Result" implementation design. |
I'm not sure. According to @marin-ma's previous comments, this issue is pending on the "Result" implementation design. |
@marin-ma yes, we should use Status and Expected to avoid throwing exception in hot paths. @mbasmanova has recently added a few examples of how that can be used. |
Description
Context
Within the current Velox implementation, there are three distinct modules utilizing different compression codecs and methods:
Parquet datasource - Uses Arrow codec
DWRF datasource - Uses a self-defined streaming compressor/decompressor
Spill - Uses Folly Codec
Proposal
It would be better to unify the compression strategy across these modules. Given that the Arrow compression API has already been integrated into Velox along with the parquet writer, it would be more efficient to embrace the Arrow compression API instead of creating a new Codec API from scratch:
Codec
interface for batch data compression and aDe/Compressor
interface for streaming data compression. It also supports configuring the parameters for different compressing algorithms withCodecOptions
.arrow::Status/Result
for return codes, the API is general for use and easy to extend. We have already implemented compressing codecs with QAT/IAA accelerators and successfully used in our project Gluten. With the unified Arrow compression API, we can directly contribute these enhancements to Velox.The text was updated successfully, but these errors were encountered: