Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add packed struct encoding #2601

Closed
4 tasks done
raunaks13 opened this issue Jul 15, 2024 · 0 comments
Closed
4 tasks done

add packed struct encoding #2601

raunaks13 opened this issue Jul 15, 2024 · 0 comments
Assignees

Comments

@raunaks13
Copy link
Contributor

raunaks13 commented Jul 15, 2024

Currently struct arrays are encoded using multiple IOPS for random access, which is suboptimal.

e.g. if we have a struct array x: {1, 2, 3}, y: {2.4, 5.6, 3.8}, z: ['a', 'b', 'c'], with x of type uint64, y of type Float32, and z of type UInt8

in the packed encoding, we would encode fields of the struct close together as a "packed struct": [1, 2.4, 'a', 2, 5.6, 'b', 3, 3.8, 'c']
This is better for random access since decoding will only use 1 IOP for random access. Of course this means the fields would have to be decoded together. But often, this is fine (especially in cases where we need to access multiple fields of the struct anyway)

This would involve encoding the individual child arrays, and then packing the encoded data together afterwards.

Tasks:

  • Packed struct encoding + packing
    • This should be an array encoder (should fall under the primitive field encoding path)
  • Packed struct unpacking + decode
  • Select which encoder to use (default struct encoder vs packed struct encoder) based on flag in metadata
@raunaks13 raunaks13 self-assigned this Jul 15, 2024
raunaks13 added a commit that referenced this issue Jul 24, 2024
Introduces a new `PackedStruct` encoding, should speed up random access
for struct data, ref #2601
- Can currently support non-nullable, primitive fixed-length types
(including fixed size list)
- Implemented as a physical type array encoder
- The user can select whether they want to use this encoding by
specifying the field `"packed"` as `true` or `false` in the metadata.
The default will use the old `StructFieldEncoder`
- Python benchmarks for reading/writing a table in case of both (i) full
scans and (ii) random access are added in `test_packed_struct.py`. The
expectation is that this encoding will perform better for random access,
and worse in case of full scans.

Benchmarking results: (10M rows, 5 struct fields, retrieving 100 rows
via random access)
Read perf:
<img width="1401" alt="Screenshot 2024-07-24 at 8 05 08 AM"
src="https://github.com/user-attachments/assets/dcaddcfc-a1a6-4ba3-b5f0-292d57b051b0">
Write perf:
<img width="1401" alt="Screenshot 2024-07-24 at 8 11 23 AM"
src="https://github.com/user-attachments/assets/f9f0c972-ce40-4a3e-a50e-5e88294d2ba3">
To reproduce run `pytest python/benchmarks/test_packed_struct.py -k
<group>`, where `group` can be `"read"` or `"write"`

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
broccoliSpicy added a commit that referenced this issue Dec 10, 2024
This PR tries to add packed struct encoding.

During encoding, it packs a struct with fixed width fields, producing a
row oriented `FixedWidthDataBlock`, then use `ValueCompressor` to
compressor to a `MiniBlock Layout`.

during decoding, it first uses `ValueDecompressor` to get the
row-oriented `FixedWidthDataBlock`, then construct a `StructDataBlock`
for output.

#3173 #2601
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant