add packed struct encoding #2601

raunaks13 · 2024-07-15T15:49:20Z

Currently struct arrays are encoded using multiple IOPS for random access, which is suboptimal.

e.g. if we have a struct array x: {1, 2, 3}, y: {2.4, 5.6, 3.8}, z: ['a', 'b', 'c'], with x of type uint64, y of type Float32, and z of type UInt8

in the packed encoding, we would encode fields of the struct close together as a "packed struct": [1, 2.4, 'a', 2, 5.6, 'b', 3, 3.8, 'c']
This is better for random access since decoding will only use 1 IOP for random access. Of course this means the fields would have to be decoded together. But often, this is fine (especially in cases where we need to access multiple fields of the struct anyway)

This would involve encoding the individual child arrays, and then packing the encoded data together afterwards.

Tasks:

Packed struct encoding + packing
- This should be an array encoder (should fall under the primitive field encoding path)
Packed struct unpacking + decode
Select which encoder to use (default struct encoder vs packed struct encoder) based on flag in metadata

The text was updated successfully, but these errors were encountered:

Introduces a new `PackedStruct` encoding, should speed up random access for struct data, ref #2601 - Can currently support non-nullable, primitive fixed-length types (including fixed size list) - Implemented as a physical type array encoder - The user can select whether they want to use this encoding by specifying the field `"packed"` as `true` or `false` in the metadata. The default will use the old `StructFieldEncoder` - Python benchmarks for reading/writing a table in case of both (i) full scans and (ii) random access are added in `test_packed_struct.py`. The expectation is that this encoding will perform better for random access, and worse in case of full scans. Benchmarking results: (10M rows, 5 struct fields, retrieving 100 rows via random access) Read perf: <img width="1401" alt="Screenshot 2024-07-24 at 8 05 08 AM" src="https://github.com/user-attachments/assets/dcaddcfc-a1a6-4ba3-b5f0-292d57b051b0"> Write perf: <img width="1401" alt="Screenshot 2024-07-24 at 8 11 23 AM" src="https://github.com/user-attachments/assets/f9f0c972-ce40-4a3e-a50e-5e88294d2ba3"> To reproduce run `pytest python/benchmarks/test_packed_struct.py -k <group>`, where `group` can be `"read"` or `"write"` --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

This PR tries to add packed struct encoding. During encoding, it packs a struct with fixed width fields, producing a row oriented `FixedWidthDataBlock`, then use `ValueCompressor` to compressor to a `MiniBlock Layout`. during decoding, it first uses `ValueDecompressor` to get the row-oriented `FixedWidthDataBlock`, then construct a `StructDataBlock` for output. #3173 #2601

raunaks13 self-assigned this Jul 15, 2024

raunaks13 mentioned this issue Jul 15, 2024

feat: add a packed struct encoding to lance #2593

Merged

raunaks13 closed this as completed Jul 24, 2024

This was referenced Sep 11, 2024

Support "packing" columns to allow faster retrieval of groups of many columns #1457

Closed

Allow packing of variable length columns #2862

Open

broccoliSpicy mentioned this issue Nov 29, 2024

feat: packed struct encoding #3186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add packed struct encoding #2601

add packed struct encoding #2601

raunaks13 commented Jul 15, 2024 •

edited

Loading

add packed struct encoding #2601

add packed struct encoding #2601

Comments

raunaks13 commented Jul 15, 2024 • edited Loading

raunaks13 commented Jul 15, 2024 •

edited

Loading