You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The kv.Store interface Scan operation has no prefix support, only start.
Prefix scans are performed by the KV code requesting everything after the
prefix and then filtering.
Add a prefix parameter to Scan, making it more like relevant lakeFS and S3
API calls.
Relevant for #4521: LakeFSOutputCommitter performs a listObjects API call
for every task. Because of this interface limitation, KV Graveler requests
a scan-after rather than a prefix scan. In all Spark cases seen, this call
returns a single object. This means the maximal number of objects is
processed on DynamoDB or PostgreSQL, returned to the lakeFS server, and
filtered there. Instead of a single object processed on the store, the
maximal number of objects is returned, decoded, and filtered -- 300 on
DynamoDB, 1000 on PostgreSQL. And this occurs once per partition; we are
testing with 4000 partitions, and this may explain some of the time
difference.
The text was updated successfully, but these errors were encountered:
Posted a draft PR for work I did on this before we abandoned LakeFSOutputCommitter. However this speedup is intense and highly relevant to LakeFSFS performance, especially when used with the default Spark FileOutputCommitter.
The kv.Store interface Scan operation has no prefix support, only start.
Prefix scans are performed by the KV code requesting everything after the
prefix and then filtering.
Add a prefix parameter to Scan, making it more like relevant lakeFS and S3
API calls.
Relevant for #4521: LakeFSOutputCommitter performs a listObjects API call
for every task. Because of this interface limitation, KV Graveler requests
a scan-after rather than a prefix scan. In all Spark cases seen, this call
returns a single object. This means the maximal number of objects is
processed on DynamoDB or PostgreSQL, returned to the lakeFS server, and
filtered there. Instead of a single object processed on the store, the
maximal number of objects is returned, decoded, and filtered -- 300 on
DynamoDB, 1000 on PostgreSQL. And this occurs once per partition; we are
testing with 4000 partitions, and this may explain some of the time
difference.
The text was updated successfully, but these errors were encountered: