-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add packed struct encoding #2601
Comments
raunaks13
added a commit
that referenced
this issue
Jul 24, 2024
Introduces a new `PackedStruct` encoding, should speed up random access for struct data, ref #2601 - Can currently support non-nullable, primitive fixed-length types (including fixed size list) - Implemented as a physical type array encoder - The user can select whether they want to use this encoding by specifying the field `"packed"` as `true` or `false` in the metadata. The default will use the old `StructFieldEncoder` - Python benchmarks for reading/writing a table in case of both (i) full scans and (ii) random access are added in `test_packed_struct.py`. The expectation is that this encoding will perform better for random access, and worse in case of full scans. Benchmarking results: (10M rows, 5 struct fields, retrieving 100 rows via random access) Read perf: <img width="1401" alt="Screenshot 2024-07-24 at 8 05 08 AM" src="https://github.com/user-attachments/assets/dcaddcfc-a1a6-4ba3-b5f0-292d57b051b0"> Write perf: <img width="1401" alt="Screenshot 2024-07-24 at 8 11 23 AM" src="https://github.com/user-attachments/assets/f9f0c972-ce40-4a3e-a50e-5e88294d2ba3"> To reproduce run `pytest python/benchmarks/test_packed_struct.py -k <group>`, where `group` can be `"read"` or `"write"` --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>
This was referenced Sep 11, 2024
broccoliSpicy
added a commit
that referenced
this issue
Dec 10, 2024
This PR tries to add packed struct encoding. During encoding, it packs a struct with fixed width fields, producing a row oriented `FixedWidthDataBlock`, then use `ValueCompressor` to compressor to a `MiniBlock Layout`. during decoding, it first uses `ValueDecompressor` to get the row-oriented `FixedWidthDataBlock`, then construct a `StructDataBlock` for output. #3173 #2601
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently struct arrays are encoded using multiple IOPS for random access, which is suboptimal.
e.g. if we have a struct array
x: {1, 2, 3}, y: {2.4, 5.6, 3.8}, z: ['a', 'b', 'c']
, withx
of typeuint64
,y
of typeFloat32
, andz
of typeUInt8
in the packed encoding, we would encode fields of the struct close together as a "packed struct":
[1, 2.4, 'a', 2, 5.6, 'b', 3, 3.8, 'c']
This is better for random access since decoding will only use 1 IOP for random access. Of course this means the fields would have to be decoded together. But often, this is fine (especially in cases where we need to access multiple fields of the struct anyway)
This would involve encoding the individual child arrays, and then packing the encoded data together afterwards.
Tasks:
The text was updated successfully, but these errors were encountered: