Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: packed struct encoding #3186

Merged
merged 21 commits into from
Dec 10, 2024

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Nov 29, 2024

This PR tries to add packed struct encoding.

During encoding, it packs a struct with fixed width fields, producing a row oriented FixedWidthDataBlock, then use ValueCompressor to compressor to a MiniBlock Layout.

during decoding, it first uses ValueDecompressor to get the row-oriented FixedWidthDataBlock, then construct a StructDataBlock for output.

#3173 #2601

@github-actions github-actions bot added enhancement New feature or request python labels Nov 29, 2024
@broccoliSpicy
Copy link
Contributor Author

broccoliSpicy commented Nov 29, 2024

the current main(lance 2.0 packed-struct encoding) doesn't work with the benchmark script in python/benchmarks/test_packed_struct.py -k "read"

@broccoliSpicy broccoliSpicy force-pushed the packed-struct-encoding branch from ff7ddfd to 72d6920 Compare December 4, 2024 16:57
@broccoliSpicy broccoliSpicy changed the title feat: packed struct encoding(draft, wip) feat: packed struct encoding Dec 4, 2024
@broccoliSpicy broccoliSpicy changed the title feat: packed struct encoding feat: packed struct encoding(wip) Dec 4, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 4, 2024

Codecov Report

Attention: Patch coverage is 88.92734% with 32 lines in your changes missing coverage. Please review.

Project coverage is 78.51%. Comparing base (ef9d0c2) to head (b15eec9).

Files with missing lines Patch % Lines
...encoding/src/encodings/physical/struct_encoding.rs 88.67% 10 Missing and 2 partials ⚠️
rust/lance-file/src/v2/reader.rs 57.14% 6 Missing and 3 partials ⚠️
rust/lance-encoding/src/data.rs 96.36% 2 Missing ⚠️
rust/lance-encoding/src/decoder.rs 89.47% 0 Missing and 2 partials ⚠️
rust/lance-encoding/src/encoder.rs 77.77% 1 Missing and 1 partial ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 75.00% 1 Missing and 1 partial ⚠️
rust/lance-encoding/src/statistics.rs 95.34% 2 Missing ⚠️
rust/lance-encoding/src/encodings/physical.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3186      +/-   ##
==========================================
+ Coverage   78.45%   78.51%   +0.06%     
==========================================
  Files         244      245       +1     
  Lines       84554    84818     +264     
  Branches    84554    84818     +264     
==========================================
+ Hits        66333    66595     +262     
+ Misses      15418    15406      -12     
- Partials     2803     2817      +14     
Flag Coverage Δ
unittests 78.51% <88.92%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@broccoliSpicy broccoliSpicy changed the title feat: packed struct encoding(wip) feat: packed struct encoding Dec 5, 2024
@broccoliSpicy
Copy link
Contributor Author

packed-struct with FixedSizeList is not supported for now.

@broccoliSpicy
Copy link
Contributor Author

broccoliSpicy commented Dec 9, 2024

lance 2.1 benchmark result(10 million rows):
write:
Screenshot 2024-12-09 at 11 59 53 AM

random access read(100 randomly generated indices for random access):
Screenshot 2024-12-09 at 11 59 36 AM

random access read, 1000 randomly generated indices for random access:
Screenshot 2024-12-09 at 12 59 50 PM

full scan read(batch_size: 1000):
when parquet read batch size not set:
Screenshot 2024-12-09 at 11 58 43 AM
when parquet read batch size is also set to 1000:

Screenshot 2024-12-09 at 12 53 06 PM

full scan read(lance batch_size: 16 * 1024):
when parquet read batch size not set:
Screenshot 2024-12-09 at 12 19 06 PM
when parquet read batch size is also set to 16 * 1024
Screenshot 2024-12-09 at 12 46 00 PM

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions but this approach seems like a good starting point overall.

@@ -218,23 +218,34 @@ impl ReaderProjection {
///
/// If the schema provided is not the schema of the entire file then
/// the projection will be invalid and the read will fail.
/// If the field is a `struct datatype` with `packed` set to true in the field metadata,
/// the whole struct has one column index.
/// To support nested `packed-struct encoding`, this method need to be further adjusted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// To support nested `packed-struct encoding`, this method need to be further adjusted.

This sentence seems redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is to document nested packed struct encoding is not supported at the moment.

{
column_indices.push(curr_column_idx);
curr_column_idx += 1;
packed_struct_fields_num = field.children.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be correct on nested structs. For example:

A-+  
| |  
| |  
B D-+
| | |
| | |
C E F

If A is packed then field.children.len() will be 2 but there are 5 elements that need to be skipped. I think it would be easiest to just do our own DFS here.

    pub fn from_whole_schema(schema: &Schema, version: LanceFileVersion) -> Self {
        let schema = Arc::new(schema.clone());
        let is_structural = version >= LanceFileVersion::V2_1;
        let mut counter = 0;
        let mut column_indices = Vec::new();
        for field in schema.fields {
            from_whole_field(field, &mut counter, &mut column_indices);
        }
        Self {
            schema,
            column_indices,
        }
    }
    fn from_whole_field(field: &Field, counter: &mut u32, column_indices: &mut Vec<u32>) {
        if is_packed_struct(field) {
             add column to column indices
             return early
        }
        if is_primitive or !is_structural {
             add column to column indices
        }
        for child in field.children {
             recurse
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested packed struct is not supported for now, in fact, the packed struct encoding will panic if encounter nested packed struct so currently in 2.1 nested packed struct data can not be written.

inside PrimitiveStructuralEncoder::do_flush:

                // if the `data_block` is a `StructDataBlock`, then this is a struct with packed struct encoding.
                if let DataBlock::Struct(ref struct_data_block) = data_block {
                    if struct_data_block
                        .children
                        .iter()
                        .any(|child| !matches!(child, DataBlock::FixedWidth(_)))
                    {
                        panic!("packed struct encoding currently only supports fixed-width fields.")
                    }
                }

}
if field_metadata
.get("packed")
.map(|v| v == "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a case insensitive comparison

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

fn get_stat(&self, stat: Stat) -> Option<Arc<dyn Array>> {
match stat {
Stat::MaxLength => {
let max_len = self.dimension * self.child.expect_single_stat::<UInt64Type>(stat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does every block have MaxLength? Is it possible the FSL's child doesn't have MaxLength?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this is vestige, statistics for fixedSizeList data block is not supported in this PR.

Some(Arc::new(UInt64Array::from(vec![max_len])))
}
Stat::DataSize => self.child.get_stat(stat),
_ => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about NullCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this code completely as fixed size list statistics is not supported in this PR.

let field_metadata = &field.metadata;
if field_metadata
.get("packed")
.map(|v| v == "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this case insensitive.

Also, this check repeats in a few places. We could maybe add is_packed_struct to FieldExt in lance-arrow/src/schema.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed.
I added two is_packed_struct, one in lance-arrow/src/schema.rs, one in lance-core/src/datatypes.field.rs, not sure how to do only one.

let mut children = vec![];

let bytes_per_row: u32 = bits_per_values.iter().sum::<u32>() / 8;
let bytes_per_row = bytes_per_row as u64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a debug assert to verify our assumption that bits_per_values.iter().all(|bpv| bpv % 8 == 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

Comment on lines 612 to 615
if field_metadata
.get("packed")
.map(|v| v == "true")
.unwrap_or(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also case insensitive (or replace with call to helper method)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@broccoliSpicy broccoliSpicy merged commit c4cb87a into lancedb:main Dec 10, 2024
20 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants