-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vortex Layouts #1676
Labels
Comments
Merged
Merged
This was referenced Dec 29, 2024
Closed
gatesn
added
wire-break
Includes a break to the serialized IPC or file format
epic
labels
Jan 2, 2025
Merged
gatesn
added a commit
that referenced
this issue
Jan 3, 2025
Initial implementation of the new structure of vortex layouts per #1676 * Only flat layout works. * I'm not 100% sure on the trait APIs, these will evolve as we pad out the implementation. * StructLayout will be worked on as part of #1782 so will probably come last. Up next: * Implementation of ChunkedLayout Open Questions: * What is the API that e.g. Python users have to precisely configure layout strategies? Can I override a layout writer for a specific field? * Similarly, how can we configure the layout scanners? Can I configure a level 0 chunked layout differently from level 2 in a chunk-of-struct-of-chunk world?
This was referenced Jan 7, 2025
Merged
Merged
gatesn
added a commit
that referenced
this issue
Jan 13, 2025
Introduces the idea of a `driver` as a pluggable way of orchestrating I/O and CPU based work. An `ExecDriver` takes a `Stream<Future<T>>` and returns a `Stream<T>`. The abstraction means we can implement drivers that use the current runtime thread (Inline), spawn the work onto an explicit Runtime, or spawn the work onto a thread pool. The `IoDriver` takes a stream of segment requests and has to resolve them. For now, we have a FileIoDriver that assumes the bytes are laid out on disk (and therefore coalescing is desirable), so it pulls requests off the queue, coalesces them, then launches concurrent I/O requests to resolve them. The VortexFile remains generic over the I/O driver, meaning the send-ness of the resulting ArrayStream is inferred based on the send-ness of the I/O driver (in the case of FileDriver, that is the send-ness of the VortexReadAt). Part of #1676
gatesn
added a commit
that referenced
this issue
Jan 13, 2025
gatesn
added a commit
that referenced
this issue
Jan 13, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
[edit] I've hijacked the "Refactor VortexFileWriter" issue to cover the broader work towards vortex layouts.
The description here is WIP
Vortex Layouts
Vortex Layouts should be seen as symmetrical to Vortex Arrays, except that the data is stored externally and they can be lazily queried with push-down pruning. They are fully independent of the storage system, requiring only a key(u32)/value(bytes) API.
A secondary benefit of separating the layout logic from storage, is that we remove all I/O from this code path allowing us to have much better control over when, where and how we interact with I/O (as well as removing all traces of async Rust from the layout logic).
Built-in Layouts
Our initial prototype ships with three layouts:
Additional layouts for the future:
As with everything in Vortex, these will be extensible by consumers of the Rust library.
Layout Strategies
The power of layouts is in how we choose them. I think by default Vortex will ship with a
StructOfChunks
layout strategy that first splits into columns, then each column is independently chunked based on a target byte size (rather than row-count). This can be pretty useful when storing columns with large values. It means the returned batches are sized based on the data that's scanned, and doesn't suffer the write-time skew that e.g. Parquet might introduce with its "ChunksOfStruct" (row-groups of columns) strategy.A
LayoutStrategy
is fed a series ofArrayData
batches that represent a row-wise chunk of an array. Each strategy may manipulate, buffer, split, this batch however it likes and pass it to some child strategies, eventually culminating in a FlatStrategy that simply serializes the data. At the end of this process, the layout strategy returns theLayoutData
(remember the analogy toArrayData
?).Layout Scan
Scanning a layout is sort of analogous to an array compute function. It's probably the only "compute function" we support on layouts for a while (maybe we support sum/min/max etc?).
From a
LayoutData
, we construct aLayoutScan
, and from this we construct aRangeScan
per horizontal "split" of the file that we wish to read. This two-level approach is so for a single scan operator layouts can share some state across ranges, e.g. DictLayout would want to share the canonicalized values array.The old issue:
Tracking a bunch of breaking changes to the format and refactoring of the writer
The text was updated successfully, but these errors were encountered: