-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow users to specify late/early materialization, adjust default threshold #2916
feat: allow users to specify late/early materialization, adjust default threshold #2916
Conversation
Coming soon: benchmarks to justify this decision |
Experiments: The following charts show the ratio between early and late materialization. Anything greater than 1 favors late materialization. Anything less than 1 favors early materialization. At 1 the two approaches give roughly the same amount of time. NVMECloud StorageResultsThe cutoff of 10 for NVME seems reasonably justified. There is not much penalty for using early materialization with 4/8 bytes even when the filter is highly selective. There are potentially very large penalties for using early materialization with 64/256 byte values (and the penalty even at 10% is much smaller / non-existent). The cutoff of 1000 for cloud seems reasonable as well. 256 byte values still prefer early materialization even for rather aggressive filters. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2916 +/- ##
==========================================
- Coverage 77.82% 77.76% -0.06%
==========================================
Files 231 231
Lines 70280 70442 +162
Branches 70280 70442 +162
==========================================
+ Hits 54695 54781 +86
- Misses 12695 12762 +67
- Partials 2890 2899 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
@@ -309,6 +310,23 @@ def scanner( | |||
number of rows (or be empty) if the rows closest to the query do not | |||
match the filter. It's generally good when the filter is not very | |||
selective. | |||
late_materialization: bool or List[str], default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the flexibility of this parameter. Very useful!
rust/lance-arrow/src/schema.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fancy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I used this for debugging, I don't think it's part of the main code path anymore. However, it was useful as I was struggling to read the verbose debug reprs. I think it needs some more improvements and it's also a lossy conversion (e.g. no distinction between string / large string / string view / etc.) so coming up with some nice compact way to represent these differences would be cool.
… used. Changes the default to prefer early materialization of most fields.
… have that baked into asserts. Just change it to 256 for the plan test.
666d466
to
26b5970
Compare
Previously we always used late materialization. In many cases this resulted in far too many IOPS which impacted the performance (and the cost) of queries.
The new approach uses early materialization for cloud storage (except for variable sized fields)
For local storage any field that has 10 or fewer bytes per value is early materialized