Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow users to specify late/early materialization, adjust default threshold #2916

Merged
merged 3 commits into from
Sep 24, 2024

Conversation

westonpace
Copy link
Contributor

Previously we always used late materialization. In many cases this resulted in far too many IOPS which impacted the performance (and the cost) of queries.

The new approach uses early materialization for cloud storage (except for variable sized fields)
For local storage any field that has 10 or fewer bytes per value is early materialized

@westonpace
Copy link
Contributor Author

Coming soon: benchmarks to justify this decision

@github-actions github-actions bot added enhancement New feature or request python labels Sep 19, 2024
@westonpace
Copy link
Contributor Author

Experiments:

The following charts show the ratio between early and late materialization. Anything greater than 1 favors late materialization. Anything less than 1 favors early materialization. At 1 the two approaches give roughly the same amount of time.

NVME

image

Cloud Storage

image

Results

The cutoff of 10 for NVME seems reasonably justified. There is not much penalty for using early materialization with 4/8 bytes even when the filter is highly selective. There are potentially very large penalties for using early materialization with 64/256 byte values (and the penalty even at 10% is much smaller / non-existent).

The cutoff of 1000 for cloud seems reasonable as well. 256 byte values still prefer early materialization even for rather aggressive filters.

@codecov-commenter
Copy link

codecov-commenter commented Sep 20, 2024

Codecov Report

Attention: Patch coverage is 61.39535% with 83 lines in your changes missing coverage. Please review.

Project coverage is 77.76%. Comparing base (f763d42) to head (26b5970).

Files with missing lines Patch % Lines
rust/lance-arrow/src/schema.rs 0.00% 63 Missing ⚠️
rust/lance/src/dataset/scanner.rs 87.61% 1 Missing and 12 partials ⚠️
rust/lance-arrow/src/lib.rs 86.66% 3 Missing and 1 partial ⚠️
rust/lance-core/src/datatypes/schema.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2916      +/-   ##
==========================================
- Coverage   77.82%   77.76%   -0.06%     
==========================================
  Files         231      231              
  Lines       70280    70442     +162     
  Branches    70280    70442     +162     
==========================================
+ Hits        54695    54781      +86     
- Misses      12695    12762      +67     
- Partials     2890     2899       +9     
Flag Coverage Δ
unittests 77.76% <61.39%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -309,6 +310,23 @@ def scanner(
number of rows (or be empty) if the rows closest to the query do not
match the filter. It's generally good when the filter is not very
selective.
late_materialization: bool or List[str], default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the flexibility of this parameter. Very useful!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I used this for debugging, I don't think it's part of the main code path anymore. However, it was useful as I was struggling to read the verbose debug reprs. I think it needs some more improvements and it's also a lossy conversion (e.g. no distinction between string / large string / string view / etc.) so coming up with some nice compact way to represent these differences would be cool.

… used.

Changes the default to prefer early materialization of most fields.
… have that baked into asserts. Just change it to 256 for the plan test.
@westonpace westonpace force-pushed the feat/custom-lazy-materialization branch from 666d466 to 26b5970 Compare September 24, 2024 15:04
@westonpace westonpace merged commit 3c17c67 into lancedb:main Sep 24, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants