Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add data file format / version information to manifest #2673

Merged
merged 6 commits into from
Aug 5, 2024

Conversation

westonpace
Copy link
Contributor

Add new "data storage format" property which allows a dataset to specify what file format (only lance) and the version to use when writing data

Introduce a configurable version to the lance writer and change FSST and bitpacking to be guarded by a 2_1 version instead of env. variables
Change compression to be based on field metadata instead of environment variables
Migrate some tests to use v2

@github-actions github-actions bot added enhancement New feature or request python labels Aug 1, 2024
Comment on lines +151 to +200
* - Encoding Name
- Encoding Type
- What it does
- Supported Versions
- When it is applied
* - Basic struct
- Field encoding
- Encodes non-nullable struct data
- >= 2.0
- Default encoding for structs
* - List
- Field encoding
- Encodes lists (nullable or non-nullable)
- >= 2.0
- Default encoding for lists
* - Basic Primitive
- Field encoding
- Encodes primitive data types using separate validity array
- >= 2.0
- Default encoding for primitive data types
* - Value
- Array encoding
- Encodes a single vector of fixed-width values
- >= 2.0
- Fallback encoding for fixed-width types
* - Binary
- Array encoding
- Encodes a single vector of variable-width data
- >= 2.0
- Fallback encoding for variable-width types
* - Dictionary
- Array encoding
- Encodes data using a dictionary array and an indices array which is useful for large data types with few unique values
- >= 2.0
- Used on string pages with fewer than 100 unique elements
* - Packed struct
- Array encoding
- Encodes a struct with fixed-width fields in a row-major format making random access more efficient
- >= 2.0
- Only used on struct types if the field metadata attribute ``"packed"`` is set to ``"true"``
* - Fsst
- Array encoding
- Compresses binary data by identifying common substrings (of 8 bytes or less) and encoding them as symbols
- >= 2.1
- Used on string pages that are not dictionary encoded
* - Bitpacking
- Array encoding
- Encodes a single vector of fixed-width values using bitpacking which is useful for integral types that do not span the full range of values
- >= 2.1
- Used on integral types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once things stabilize I think we can replace this with a more formal "spec" where we layout exactly how these encodings work but I don't see any reason to rush at the moment.

@codecov-commenter
Copy link

codecov-commenter commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 86.07242% with 50 lines in your changes missing coverage. Please review.

Project coverage is 79.41%. Comparing base (712405e) to head (74c1a24).
Report is 2 commits behind head on main.

Files Patch % Lines
rust/lance-encoding/src/version.rs 64.70% 12 Missing ⚠️
rust/lance-file/src/v2/reader.rs 9.09% 10 Missing ⚠️
rust/lance-encoding/src/encoder.rs 90.72% 1 Missing and 8 partials ⚠️
rust/lance/src/dataset/scanner.rs 77.27% 3 Missing and 2 partials ⚠️
rust/lance-file/src/v2/writer.rs 82.60% 2 Missing and 2 partials ⚠️
rust/lance-table/src/format/manifest.rs 90.00% 4 Missing ⚠️
rust/lance/src/dataset/transaction.rs 83.33% 2 Missing ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 80.00% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/updater.rs 83.33% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/write.rs 94.73% 0 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2673      +/-   ##
==========================================
- Coverage   79.61%   79.41%   -0.20%     
==========================================
  Files         226      226              
  Lines       66516    66543      +27     
  Branches    66516    66543      +27     
==========================================
- Hits        52955    52848     -107     
- Misses      10465    10593     +128     
- Partials     3096     3102       +6     
Flag Coverage Δ
unittests 79.41% <86.07%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is well done. Thanks for updating the format docs. Glad to see that more in sync.

Comment on lines 62 to 63
impl TryFrom<&str> for LanceFileVersion {
type Error = Error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the more canonical trait to implement is FromStr, which enables the parse method: https://doc.rust-lang.org/std/str/trait.FromStr.html

…ify what file format (only lance) and the version to use when writing data

Introduce a configurable version to the lance writer and change FSST and bitpacking to be guarded by a 2_1 version instead of env. variables
Change compression to be based on field metadata instead of environment variables
Migrate some tests to use v2
@westonpace westonpace merged commit 70a75f3 into lancedb:main Aug 5, 2024
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants