Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV drivers scan too many files with small "subdirectory" listings #4734

Closed
arielshaqed opened this issue Dec 5, 2022 · 2 comments
Closed
Assignees
Labels
team/cloud-native Team cloud native team/versioning-engine Team versioning engine

Comments

@arielshaqed
Copy link
Contributor

The kv.Store interface Scan operation has no prefix support, only start.
Prefix scans are performed by the KV code requesting everything after the
prefix and then filtering.

Add a prefix parameter to Scan, making it more like relevant lakeFS and S3
API calls.

Relevant for #4521: LakeFSOutputCommitter performs a listObjects API call
for every task. Because of this interface limitation, KV Graveler requests
a scan-after rather than a prefix scan. In all Spark cases seen, this call
returns a single object. This means the maximal number of objects is
processed on DynamoDB or PostgreSQL, returned to the lakeFS server, and
filtered there. Instead of a single object processed on the store, the
maximal number of objects is returned, decoded, and filtered -- 300 on
DynamoDB, 1000 on PostgreSQL. And this occurs once per partition; we are
testing with 4000 partitions, and this may explain some of the time
difference.

@arielshaqed arielshaqed self-assigned this Dec 5, 2022
@talSofer talSofer added the team/ecosystem Team Ecosystem label Dec 5, 2022
@talSofer talSofer added this to the LakeFSOutputCommitter 1 milestone Dec 5, 2022
@arielshaqed arielshaqed linked a pull request Dec 12, 2022 that will close this issue
@arielshaqed
Copy link
Contributor Author

Posted a draft PR for work I did on this before we abandoned LakeFSOutputCommitter. However this speedup is intense and highly relevant to LakeFSFS performance, especially when used with the default Spark FileOutputCommitter.

@arielshaqed arielshaqed removed the team/ecosystem Team Ecosystem label Dec 19, 2022
@arielshaqed arielshaqed removed their assignment Dec 19, 2022
@arielshaqed arielshaqed removed this from the LakeFSOutputCommitter 1 milestone Dec 19, 2022
@guy-har guy-har added team/cloud-native Team cloud native team/versioning-engine Team versioning engine labels Dec 21, 2022
@ortz ortz assigned ortz and guy-har and unassigned ortz Dec 28, 2022
@arielshaqed
Copy link
Contributor Author

@guy-har we should close this, no?

@guy-har guy-har closed this as completed Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/cloud-native Team cloud native team/versioning-engine Team versioning engine
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants