-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: packed struct encoding #3186
feat: packed struct encoding #3186
Conversation
the current main(lance 2.0 packed-struct encoding) doesn't work with the benchmark script in |
ff7ddfd
to
72d6920
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3186 +/- ##
==========================================
+ Coverage 78.45% 78.51% +0.06%
==========================================
Files 244 245 +1
Lines 84554 84818 +264
Branches 84554 84818 +264
==========================================
+ Hits 66333 66595 +262
+ Misses 15418 15406 -12
- Partials 2803 2817 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
packed-struct with FixedSizeList is not supported for now. |
of struct is fixed-width data block when using packed-struct encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions but this approach seems like a good starting point overall.
@@ -218,23 +218,34 @@ impl ReaderProjection { | |||
/// | |||
/// If the schema provided is not the schema of the entire file then | |||
/// the projection will be invalid and the read will fail. | |||
/// If the field is a `struct datatype` with `packed` set to true in the field metadata, | |||
/// the whole struct has one column index. | |||
/// To support nested `packed-struct encoding`, this method need to be further adjusted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// To support nested `packed-struct encoding`, this method need to be further adjusted. |
This sentence seems redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence is to document nested packed struct
encoding is not supported at the moment.
{ | ||
column_indices.push(curr_column_idx); | ||
curr_column_idx += 1; | ||
packed_struct_fields_num = field.children.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be correct on nested structs. For example:
A-+
| |
| |
B D-+
| | |
| | |
C E F
If A is packed then field.children.len()
will be 2 but there are 5 elements that need to be skipped. I think it would be easiest to just do our own DFS here.
pub fn from_whole_schema(schema: &Schema, version: LanceFileVersion) -> Self {
let schema = Arc::new(schema.clone());
let is_structural = version >= LanceFileVersion::V2_1;
let mut counter = 0;
let mut column_indices = Vec::new();
for field in schema.fields {
from_whole_field(field, &mut counter, &mut column_indices);
}
Self {
schema,
column_indices,
}
}
fn from_whole_field(field: &Field, counter: &mut u32, column_indices: &mut Vec<u32>) {
if is_packed_struct(field) {
add column to column indices
return early
}
if is_primitive or !is_structural {
add column to column indices
}
for child in field.children {
recurse
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested packed struct is not supported for now, in fact, the packed struct encoding
will panic if encounter nested packed struct
so currently in 2.1 nested packed struct data can not be written.
inside PrimitiveStructuralEncoder::do_flush:
// if the `data_block` is a `StructDataBlock`, then this is a struct with packed struct encoding.
if let DataBlock::Struct(ref struct_data_block) = data_block {
if struct_data_block
.children
.iter()
.any(|child| !matches!(child, DataBlock::FixedWidth(_)))
{
panic!("packed struct encoding currently only supports fixed-width fields.")
}
}
rust/lance-file/src/v2/reader.rs
Outdated
} | ||
if field_metadata | ||
.get("packed") | ||
.map(|v| v == "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a case insensitive comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
fn get_stat(&self, stat: Stat) -> Option<Arc<dyn Array>> { | ||
match stat { | ||
Stat::MaxLength => { | ||
let max_len = self.dimension * self.child.expect_single_stat::<UInt64Type>(stat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does every block have MaxLength
? Is it possible the FSL's child doesn't have MaxLength
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this is vestige, statistics for fixedSizeList
data block is not supported in this PR.
Some(Arc::new(UInt64Array::from(vec![max_len]))) | ||
} | ||
Stat::DataSize => self.child.get_stat(stat), | ||
_ => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about NullCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this code completely as fixed size list statistics is not supported in this PR.
rust/lance-encoding/src/decoder.rs
Outdated
let field_metadata = &field.metadata; | ||
if field_metadata | ||
.get("packed") | ||
.map(|v| v == "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this case insensitive.
Also, this check repeats in a few places. We could maybe add is_packed_struct
to FieldExt
in lance-arrow/src/schema.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed.
I added two is_packed_struct
, one in lance-arrow/src/schema.rs
, one in lance-core/src/datatypes.field.rs
, not sure how to do only one.
let mut children = vec![]; | ||
|
||
let bytes_per_row: u32 = bits_per_values.iter().sum::<u32>() / 8; | ||
let bytes_per_row = bytes_per_row as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a debug assert to verify our assumption that bits_per_values.iter().all(|bpv| bpv % 8 == 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks.
if field_metadata | ||
.get("packed") | ||
.map(|v| v == "true") | ||
.unwrap_or(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also case insensitive (or replace with call to helper method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
This PR tries to add packed struct encoding.
During encoding, it packs a struct with fixed width fields, producing a row oriented
FixedWidthDataBlock
, then useValueCompressor
to compressor to aMiniBlock Layout
.during decoding, it first uses
ValueDecompressor
to get the row-orientedFixedWidthDataBlock
, then construct aStructDataBlock
for output.#3173 #2601