Vortex Layouts #1676

gatesn · 2024-12-13T17:11:59Z

[edit] I've hijacked the "Refactor VortexFileWriter" issue to cover the broader work towards vortex layouts.

The description here is WIP

Vortex Layouts

Vortex Layouts should be seen as symmetrical to Vortex Arrays, except that the data is stored externally and they can be lazily queried with push-down pruning. They are fully independent of the storage system, requiring only a key(u32)/value(bytes) API.

A secondary benefit of separating the layout logic from storage, is that we remove all I/O from this code path allowing us to have much better control over when, where and how we interact with I/O (as well as removing all traces of async Rust from the layout logic).

Built-in Layouts

Our initial prototype ships with three layouts:

Flat - an atomic array (of any DType). The serialized form does not include the dtype (to prevent duplication).
Chunked - row-based partitioning of an array.
Struct - col-based partitioning of a struct array.

Additional layouts for the future:

Dictionary - pull out a shared dictionary across another child layout, e.g. store values as flat layout, store codes in a chunked layout to use one dictionary across all chunks of a column.
- See Support shared dictionaries across chunked arrays #252
FlatStruct - same as Struct, except with flattened nested fields.

As with everything in Vortex, these will be extensible by consumers of the Rust library.

Layout Strategies

The power of layouts is in how we choose them. I think by default Vortex will ship with a StructOfChunks layout strategy that first splits into columns, then each column is independently chunked based on a target byte size (rather than row-count). This can be pretty useful when storing columns with large values. It means the returned batches are sized based on the data that's scanned, and doesn't suffer the write-time skew that e.g. Parquet might introduce with its "ChunksOfStruct" (row-groups of columns) strategy.

A LayoutStrategy is fed a series of ArrayData batches that represent a row-wise chunk of an array. Each strategy may manipulate, buffer, split, this batch however it likes and pass it to some child strategies, eventually culminating in a FlatStrategy that simply serializes the data. At the end of this process, the layout strategy returns the LayoutData (remember the analogy to ArrayData?).

Layout Scan

Scanning a layout is sort of analogous to an array compute function. It's probably the only "compute function" we support on layouts for a while (maybe we support sum/min/max etc?).

From a LayoutData, we construct a LayoutScan, and from this we construct a RangeScan per horizontal "split" of the file that we wish to read. This two-level approach is so for a single scan operator layouts can share some state across ranges, e.g. DictLayout would want to share the canonicalized values array.

The old issue:

Tracking a bunch of breaking changes to the format and refactoring of the writer

VortexFileWriter to pass chunks into an abstract LayoutWriter.
- We ship a default LayoutWriter that is struct-of-chunked.
Impl LayoutWriter for struct, chunked, flat, and later dict layouts.
Layout flat buffer to hold BlockIds instead of offset/length. This allows Layout to be abstract over the storage format and not assume linear bytes (e.g. can store in an arbitrary block device).
Postscript to point to DType + FileLayout and not assume contiguous bytes.
Ensure we don't write extra padding / framing, e.g. we sometimes use IPCMessage and MessageWriter for writing to the file, even though this adds framing for a streaming format.
Allow configurable padding in the file to support mem-mapping with zero-copy.
- See Clean up alignment #115
Allow configurable compression for buffers.

The text was updated successfully, but these errors were encountered:

Part of #1676

Initial implementation of the new structure of vortex layouts per #1676 * Only flat layout works. * I'm not 100% sure on the trait APIs, these will evolve as we pad out the implementation. * StructLayout will be worked on as part of #1782 so will probably come last. Up next: * Implementation of ChunkedLayout Open Questions: * What is the API that e.g. Python users have to precisely configure layout strategies? Can I override a layout writer for a specific field? * Similarly, how can we configure the layout scanners? Can I configure a level 0 chunked layout differently from level 2 in a chunk-of-struct-of-chunk world?

Introduces the idea of a `driver` as a pluggable way of orchestrating I/O and CPU based work. An `ExecDriver` takes a `Stream<Future<T>>` and returns a `Stream<T>`. The abstraction means we can implement drivers that use the current runtime thread (Inline), spawn the work onto an explicit Runtime, or spawn the work onto a thread pool. The `IoDriver` takes a stream of segment requests and has to resolve them. For now, we have a FileIoDriver that assumes the bytes are laid out on disk (and therefore coalescing is desirable), so it pulls requests off the queue, coalesces them, then launches concurrent I/O requests to resolve them. The VortexFile remains generic over the I/O driver, meaning the send-ness of the resulting ArrayStream is inferred based on the send-ness of the I/O driver (in the case of FileDriver, that is the send-ness of the VortexReadAt). Part of #1676

Part of #1676

gatesn mentioned this issue Dec 14, 2024

IPC Message clean up #1686

Merged

gatesn added a commit that referenced this issue Dec 16, 2024

IPC Message clean up (#1686)

4254d3c

Part of #1676

gatesn mentioned this issue Dec 16, 2024

Message Codec #1692

Merged

lwwmanning mentioned this issue Dec 17, 2024

[DNM] benchmarks against object storage #1472

Closed

This was referenced Dec 29, 2024

some issues about reading/writing vortex file #1749

Closed

Clean up alignment #115

Closed

philippemnoel mentioned this issue Jan 1, 2025

Swap Tantivy fast fields for Vortex / Our own format? paradedb/paradedb#1916

Open

gatesn added wire-break Includes a break to the serialized IPC or file format epic labels Jan 2, 2025

gatesn self-assigned this Jan 2, 2025

gatesn changed the title ~~Refactor VortexFileWriter~~ Vortex Layouts Jan 2, 2025

gatesn mentioned this issue Jan 3, 2025

Initial Vortex Layouts #1805

Merged

This was referenced Jan 7, 2025

Vortex Layouts - Chunked #1814

Merged

Vortex Layouts - Chunked #1819

Merged

Vortex Layouts - chunk pruning #1824

Merged

Arc layout scan #1825

Merged

Vortex Layouts File V2 #1830

Merged

Vortex Layouts - Drivers #1914

Merged

gatesn mentioned this issue Jan 13, 2025

Vortex Layouts - Some Cleanup #1917

Merged

gatesn added a commit that referenced this issue Jan 13, 2025

Vortex Layouts - Some Cleanup (#1917)

86b6ab4

Part of #1676

gatesn mentioned this issue Jan 13, 2025

Support opening Vortex files without I/O #1920

Merged

gatesn added a commit that referenced this issue Jan 13, 2025

Support opening Vortex files without I/O (#1920)

99eb574

Part of #1676

shenlei149 mentioned this issue Jan 15, 2025

the diff between file size and array size is too large #1952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vortex Layouts #1676

Vortex Layouts #1676

gatesn commented Dec 13, 2024 •

edited

Loading

Vortex Layouts #1676

Vortex Layouts #1676

Comments

gatesn commented Dec 13, 2024 • edited Loading

Vortex Layouts

Built-in Layouts

Layout Strategies

Layout Scan

gatesn commented Dec 13, 2024 •

edited

Loading