You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
case:
table = db.open_table("lineitem")
res = table.search().where("l_shipmode = 'FOB'").limit(1000000000).to_pandas() Latency of old format: Latency of new format:
The text was updated successfully, but these errors were encountered:
Thanks for the report, this is definitely something to fix. This query is triggering late materialization. The v2 path is not coalescing I/O effectively here and it's translating into way too many tiny reads.
Should fix#2629, addresses #1959
1. Earlier on randomly accessing rows, each request for a row was being
scheduled separately, which increased overhead, especially on large
datasets. This PR coalesces take scheduling when requests are within
`block_size` distance from each other. The block size is determined
based on the system.
2. The binary scheduler was also scheduling decoding of all indices
individually. This updates the binary scheduler so that it schedules all
offsets at once. These are then processed to determine which bytes to
decode like before.
3. A script we can use to compare v1 vs v2 performance is added as
`test_random_access.py`.
Specifically, on the lineitem dataset (same file from the issue above):
- v1 query time: `0.12s`
- v2 query time (before): `2.8s`.
- v2 query time (after (1)): `0.54s`.
- v2 query time (after (1) and (2)): `0.02s`.
lancedb version: 0.9.0 use Lance v2,the performance of the new file format is slower than the old format
dataset:
parquet:https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet
parquet to lance:
lance.write_dataset(parquet, lance_path, use_legacy_format=False)
case:
table = db.open_table("lineitem")
res = table.search().where("l_shipmode = 'FOB'").limit(1000000000).to_pandas()
Latency of old format:
Latency of new format:
The text was updated successfully, but these errors were encountered: