KV drivers scan too many files with small "subdirectory" listings #4734

arielshaqed · 2022-12-05T06:42:48Z

The kv.Store interface Scan operation has no prefix support, only start.
Prefix scans are performed by the KV code requesting everything after the
prefix and then filtering.

Add a prefix parameter to Scan, making it more like relevant lakeFS and S3
API calls.

Relevant for #4521: LakeFSOutputCommitter performs a listObjects API call
for every task. Because of this interface limitation, KV Graveler requests
a scan-after rather than a prefix scan. In all Spark cases seen, this call
returns a single object. This means the maximal number of objects is
processed on DynamoDB or PostgreSQL, returned to the lakeFS server, and
filtered there. Instead of a single object processed on the store, the
maximal number of objects is returned, decoded, and filtered -- 300 on
DynamoDB, 1000 on PostgreSQL. And this occurs once per partition; we are
testing with 4000 partitions, and this may explain some of the time
difference.

arielshaqed · 2022-12-12T11:21:55Z

Posted a draft PR for work I did on this before we abandoned LakeFSOutputCommitter. However this speedup is intense and highly relevant to LakeFSFS performance, especially when used with the default Spark FileOutputCommitter.

arielshaqed · 2023-02-22T13:21:46Z

@guy-har we should close this, no?

arielshaqed self-assigned this Dec 5, 2022

talSofer added the team/ecosystem Team Ecosystem label Dec 5, 2022

talSofer added this to the LakeFSOutputCommitter 1 milestone Dec 5, 2022

arielshaqed linked a pull request Dec 12, 2022 that will close this issue

Add prefix scan to KV implementations #4792

Closed

arielshaqed mentioned this issue Dec 12, 2022

Perform fewer API calls for exists #4797

Merged

arielshaqed removed the team/ecosystem Team Ecosystem label Dec 19, 2022

arielshaqed removed their assignment Dec 19, 2022

arielshaqed removed this from the LakeFSOutputCommitter 1 milestone Dec 19, 2022

guy-har added team/cloud-native Team cloud native team/versioning-engine Team versioning engine labels Dec 21, 2022

ortz assigned ortz and guy-har and unassigned ortz Dec 28, 2022

guy-har closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV drivers scan too many files with small "subdirectory" listings #4734

KV drivers scan too many files with small "subdirectory" listings #4734

arielshaqed commented Dec 5, 2022

arielshaqed commented Dec 12, 2022

arielshaqed commented Feb 22, 2023

KV drivers scan too many files with small "subdirectory" listings #4734

KV drivers scan too many files with small "subdirectory" listings #4734

Comments

arielshaqed commented Dec 5, 2022

arielshaqed commented Dec 12, 2022

arielshaqed commented Feb 22, 2023