feat: allow users to specify late/early materialization, adjust default threshold #2916

westonpace · 2024-09-19T22:59:54Z

Previously we always used late materialization. In many cases this resulted in far too many IOPS which impacted the performance (and the cost) of queries.

The new approach uses early materialization for cloud storage (except for variable sized fields)
For local storage any field that has 10 or fewer bytes per value is early materialized

westonpace · 2024-09-19T23:00:07Z

Coming soon: benchmarks to justify this decision

westonpace · 2024-09-20T13:02:05Z

Experiments:

The following charts show the ratio between early and late materialization. Anything greater than 1 favors late materialization. Anything less than 1 favors early materialization. At 1 the two approaches give roughly the same amount of time.

NVME

Cloud Storage

Results

The cutoff of 10 for NVME seems reasonably justified. There is not much penalty for using early materialization with 4/8 bytes even when the filter is highly selective. There are potentially very large penalties for using early materialization with 64/256 byte values (and the penalty even at 10% is much smaller / non-existent).

The cutoff of 1000 for cloud seems reasonable as well. 256 byte values still prefer early materialization even for rather aggressive filters.

codecov-commenter · 2024-09-20T13:52:53Z

Codecov Report

Attention: Patch coverage is 61.39535% with 83 lines in your changes missing coverage. Please review.

Project coverage is 77.76%. Comparing base (f763d42) to head (26b5970).

Files with missing lines	Patch %	Lines
rust/lance-arrow/src/schema.rs	0.00%	63 Missing ⚠️
rust/lance/src/dataset/scanner.rs	87.61%	1 Missing and 12 partials ⚠️
rust/lance-arrow/src/lib.rs	86.66%	3 Missing and 1 partial ⚠️
rust/lance-core/src/datatypes/schema.rs	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2916      +/-   ##
==========================================
- Coverage   77.82%   77.76%   -0.06%     
==========================================
  Files         231      231              
  Lines       70280    70442     +162     
  Branches    70280    70442     +162     
==========================================
+ Hits        54695    54781      +86     
- Misses      12695    12762      +67     
- Partials     2890     2899       +9

Flag	Coverage Δ
unittests	`77.76% <61.39%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127

Nice work!

wjones127 · 2024-09-24T00:33:31Z

python/python/lance/dataset.py

@@ -309,6 +310,23 @@ def scanner(
            number of rows (or be empty) if the rows closest to the query do not
            match the filter.  It's generally good when the filter is not very
            selective.
+        late_materialization: bool or List[str], default None


I like the flexibility of this parameter. Very useful!

wjones127 · 2024-09-24T00:35:50Z

rust/lance-arrow/src/schema.rs

Yeah, I used this for debugging, I don't think it's part of the main code path anymore. However, it was useful as I was struggling to read the verbose debug reprs. I think it needs some more improvements and it's also a lossy conversion (e.g. no distinction between string / large string / string view / etc.) so coming up with some nice compact way to represent these differences would be cool.

… used. Changes the default to prefer early materialization of most fields.

… have that baked into asserts. Just change it to 256 for the plan test.

…tion in python

github-actions bot added enhancement New feature or request python labels Sep 19, 2024

wjones127 approved these changes Sep 24, 2024

View reviewed changes

westonpace added 3 commits September 24, 2024 08:04

Add the ability to customize whether early or late materialization is…

2acd252

… used. Changes the default to prefer early materialization of most fields.

Keep the dimension for TestVectorDataset as 32 for most tests as they…

4d9d71e

… have that baked into asserts. Just change it to 256 for the plan test.

Fix missing default. Change materialization_style to late_materializa…

26b5970

…tion in python

westonpace force-pushed the feat/custom-lazy-materialization branch from 666d466 to 26b5970 Compare September 24, 2024 15:04

westonpace merged commit 3c17c67 into lancedb:main Sep 24, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow users to specify late/early materialization, adjust default threshold #2916

feat: allow users to specify late/early materialization, adjust default threshold #2916

westonpace commented Sep 19, 2024

westonpace commented Sep 19, 2024

westonpace commented Sep 20, 2024

codecov-commenter commented Sep 20, 2024 •

edited

Loading

wjones127 left a comment

wjones127 Sep 24, 2024

wjones127 Sep 24, 2024

westonpace Sep 24, 2024

feat: allow users to specify late/early materialization, adjust default threshold #2916

feat: allow users to specify late/early materialization, adjust default threshold #2916

Conversation

westonpace commented Sep 19, 2024

westonpace commented Sep 19, 2024

westonpace commented Sep 20, 2024

NVME

Cloud Storage

Results

codecov-commenter commented Sep 20, 2024 • edited Loading

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Sep 24, 2024

Choose a reason for hiding this comment

wjones127 Sep 24, 2024

Choose a reason for hiding this comment

westonpace Sep 24, 2024

Choose a reason for hiding this comment

codecov-commenter commented Sep 20, 2024 •

edited

Loading