-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add data file format / version information to manifest #2673
feat: add data file format / version information to manifest #2673
Conversation
* - Encoding Name | ||
- Encoding Type | ||
- What it does | ||
- Supported Versions | ||
- When it is applied | ||
* - Basic struct | ||
- Field encoding | ||
- Encodes non-nullable struct data | ||
- >= 2.0 | ||
- Default encoding for structs | ||
* - List | ||
- Field encoding | ||
- Encodes lists (nullable or non-nullable) | ||
- >= 2.0 | ||
- Default encoding for lists | ||
* - Basic Primitive | ||
- Field encoding | ||
- Encodes primitive data types using separate validity array | ||
- >= 2.0 | ||
- Default encoding for primitive data types | ||
* - Value | ||
- Array encoding | ||
- Encodes a single vector of fixed-width values | ||
- >= 2.0 | ||
- Fallback encoding for fixed-width types | ||
* - Binary | ||
- Array encoding | ||
- Encodes a single vector of variable-width data | ||
- >= 2.0 | ||
- Fallback encoding for variable-width types | ||
* - Dictionary | ||
- Array encoding | ||
- Encodes data using a dictionary array and an indices array which is useful for large data types with few unique values | ||
- >= 2.0 | ||
- Used on string pages with fewer than 100 unique elements | ||
* - Packed struct | ||
- Array encoding | ||
- Encodes a struct with fixed-width fields in a row-major format making random access more efficient | ||
- >= 2.0 | ||
- Only used on struct types if the field metadata attribute ``"packed"`` is set to ``"true"`` | ||
* - Fsst | ||
- Array encoding | ||
- Compresses binary data by identifying common substrings (of 8 bytes or less) and encoding them as symbols | ||
- >= 2.1 | ||
- Used on string pages that are not dictionary encoded | ||
* - Bitpacking | ||
- Array encoding | ||
- Encodes a single vector of fixed-width values using bitpacking which is useful for integral types that do not span the full range of values | ||
- >= 2.1 | ||
- Used on integral types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once things stabilize I think we can replace this with a more formal "spec" where we layout exactly how these encodings work but I don't see any reason to rush at the moment.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2673 +/- ##
==========================================
- Coverage 79.61% 79.41% -0.20%
==========================================
Files 226 226
Lines 66516 66543 +27
Branches 66516 66543 +27
==========================================
- Hits 52955 52848 -107
- Misses 10465 10593 +128
- Partials 3096 3102 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is well done. Thanks for updating the format docs. Glad to see that more in sync.
rust/lance-encoding/src/version.rs
Outdated
impl TryFrom<&str> for LanceFileVersion { | ||
type Error = Error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the more canonical trait to implement is FromStr
, which enables the parse
method: https://doc.rust-lang.org/std/str/trait.FromStr.html
…ify what file format (only lance) and the version to use when writing data Introduce a configurable version to the lance writer and change FSST and bitpacking to be guarded by a 2_1 version instead of env. variables Change compression to be based on field metadata instead of environment variables Migrate some tests to use v2
… let's use a new file
319b051
to
7d1f114
Compare
Add new "data storage format" property which allows a dataset to specify what file format (only lance) and the version to use when writing data
Introduce a configurable version to the lance writer and change FSST and bitpacking to be guarded by a 2_1 version instead of env. variables
Change compression to be based on field metadata instead of environment variables
Migrate some tests to use v2