Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lancedb version: 0.9.0 use Lance v2,the performance of the new file format is slower than the old format #2629

Closed
wpjgit opened this issue Jul 22, 2024 · 1 comment · Fixed by #2636
Assignees

Comments

@wpjgit
Copy link

wpjgit commented Jul 22, 2024

lancedb version: 0.9.0 use Lance v2,the performance of the new file format is slower than the old format
image

dataset:
parquet:https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet
parquet to lance:
lance.write_dataset(parquet, lance_path, use_legacy_format=False)
image

case:
table = db.open_table("lineitem")
res = table.search().where("l_shipmode = 'FOB'").limit(1000000000).to_pandas()
Latency of old format:
image
Latency of new format:
image

@westonpace
Copy link
Contributor

Thanks for the report, this is definitely something to fix. This query is triggering late materialization. The v2 path is not coalescing I/O effectively here and it's translating into way too many tiny reads.

raunaks13 added a commit that referenced this issue Jul 29, 2024
Should fix #2629, addresses #1959 
1. Earlier on randomly accessing rows, each request for a row was being
scheduled separately, which increased overhead, especially on large
datasets. This PR coalesces take scheduling when requests are within
`block_size` distance from each other. The block size is determined
based on the system.
2. The binary scheduler was also scheduling decoding of all indices
individually. This updates the binary scheduler so that it schedules all
offsets at once. These are then processed to determine which bytes to
decode like before.
3. A script we can use to compare v1 vs v2 performance is added as
`test_random_access.py`.

Specifically, on the lineitem dataset (same file from the issue above):
- v1 query time: `0.12s`
- v2 query time (before): `2.8s`.
- v2 query time (after (1)): `0.54s`. 
- v2 query time (after (1) and (2)): `0.02s`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants