lancedb version: 0.9.0 use Lance v2，the performance of the new file format is slower than the old format #2629

wpjgit · 2024-07-22T09:24:31Z

lancedb version: 0.9.0 use Lance v2，the performance of the new file format is slower than the old format

dataset:
parquet：https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet
parquet to lance:
lance.write_dataset(parquet, lance_path, use_legacy_format=False)

case:
table = db.open_table("lineitem")
res = table.search().where("l_shipmode = 'FOB'").limit(1000000000).to_pandas()
Latency of old format:

Latency of new format:

westonpace · 2024-07-22T16:10:34Z

Thanks for the report, this is definitely something to fix. This query is triggering late materialization. The v2 path is not coalescing I/O effectively here and it's translating into way too many tiny reads.

Should fix #2629, addresses #1959 1. Earlier on randomly accessing rows, each request for a row was being scheduled separately, which increased overhead, especially on large datasets. This PR coalesces take scheduling when requests are within `block_size` distance from each other. The block size is determined based on the system. 2. The binary scheduler was also scheduling decoding of all indices individually. This updates the binary scheduler so that it schedules all offsets at once. These are then processed to determine which bytes to decode like before. 3. A script we can use to compare v1 vs v2 performance is added as `test_random_access.py`. Specifically, on the lineitem dataset (same file from the issue above): - v1 query time: `0.12s` - v2 query time (before): `2.8s`. - v2 query time (after (1)): `0.54s`. - v2 query time (after (1) and (2)): `0.02s`.

wjones127 assigned westonpace Jul 22, 2024

raunaks13 mentioned this issue Jul 24, 2024

feat: coalesce scheduling of reads to speed up random access #2636

Merged

raunaks13 closed this as completed in #2636 Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lancedb version: 0.9.0 use Lance v2，the performance of the new file format is slower than the old format #2629

lancedb version: 0.9.0 use Lance v2，the performance of the new file format is slower than the old format #2629

wpjgit commented Jul 22, 2024

westonpace commented Jul 22, 2024

lancedb version: 0.9.0 use Lance v2，the performance of the new file format is slower than the old format #2629

lancedb version: 0.9.0 use Lance v2，the performance of the new file format is slower than the old format #2629

Comments

wpjgit commented Jul 22, 2024

westonpace commented Jul 22, 2024