-
Notifications
You must be signed in to change notification settings - Fork 532
feat: constant-time manifest lookup on object stores #2798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: constant-time manifest lookup on object stores #2798
Conversation
Benchmark resultsBenchmark scripts
Running benchmark: import lance
import pyarrow as pa
import pyarrow.fs as fs
import timeit
def bench_manifest_paths(use_v2, uri):
# print("doing setup")
# data = pa.table({'a': range(1)})
# dataset = lance.write_dataset(data, uri, enable_v2_manifest_paths=use_v2)
# for i in range(10_000):
# dataset.delete("false")
# print("dataset now is on version: {}".format(dataset.version))
s3, path = fs.FileSystem.from_uri(uri)
infos = s3.get_file_info(fs.FileSelector(path + '/_versions', recursive=True))
print("number of versions: {}".format(len(infos)))
iters = 20
total_time = timeit.timeit(lambda: lance.dataset(uri), number=iters)
print(total_time / iters)
uri = "s3://lance-performance-testing/test_v2_manifests"
# uri = "test_ds"
# aws s3 cp --recursive ./test_ds s3://lance-performance-testing/test_v2_manifests
# aws s3 cp --recursive ./test_ds_v1 s3://lance-performance-testing/test_v2_manifests_v1
bench_manifest_paths(True, uri)
bench_manifest_paths(False, uri + "_v1")Reducing number of files: import pyarrow.fs as fs
import argparse
import boto3
uris = [
"s3://lance-performance-testing/test_v2_manifests",
"s3://lance-performance-testing/test_v2_manifests_v1"
]
for uri in uris:
s3, path = fs.FileSystem.from_uri(uri)
infos = s3.get_file_info(fs.FileSelector(path + '/_versions', recursive=True))
parser = argparse.ArgumentParser()
parser.add_argument('target_files', type=int)
args = parser.parse_args()
to_delete = len(infos) - args.target_files
print(to_delete)
s3_client = boto3.client('s3')
objects = [{'Key': info.path.split("/", maxsplit=1)[1]} for info in infos[:to_delete]]
print(objects[:10])
s3_client.delete_objects(Bucket='lance-performance-testing', Delete={'Objects': objects}) |
2a551c3 to
9d23023
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2798 +/- ##
========================================
Coverage 77.94% 77.95%
========================================
Files 229 229
Lines 70147 70539 +392
Branches 70147 70539 +392
========================================
+ Hits 54679 54987 +308
- Misses 12393 12466 +73
- Partials 3075 3086 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks well thought out. Does this scheme work on local filesystems? For some reason I thought it did not. I ask because it is not clear from the comments that the V2 scheme should only be used on local filesystems.
| let version = scheme | ||
| .parse_version(meta.location.filename().unwrap()) | ||
| .unwrap(); | ||
| if version > current_version { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this must be true according to our understanding of object stores yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10.manifest will be before 8.manifest, so not necessarily. It would be if we had zero-padded the numbers.
| } | ||
| }); | ||
|
|
||
| let first = valid_manifests.next().await.transpose()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the case of v2 we only look at the first result. Is there value in some kind of debug_assert that looks at the remaining asserts and ensures our understanding of object stores is correct? E.g. what happens if some new object store (e.g. r2, digital ocean, etc.) decides to use a different convention? Would we catch it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sanity check that looks at the first 1k results to see if they are ordered. This is the page size in object store list operations. I've verified there is no latency impact on S3.
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_v2_manifest_path_create() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want a test creating a v2 manifest using commit instead of write?
It doesn't make much of a difference on local filesystems. Local filesystems have to use a special code branch that makes them list the entire directory every time, instead of relying upon the lexical ordering. |
b02dbe1 to
9ba0fca
Compare
The new V2 manifest path scheme makes discovering the latest version of a table constant time on object stores, regardless of the number of versions in the table. See benchmarks in the PR here: lance-format/lance#2798 Closes #1583
BREAKING CHANGE: defaults new datasets to use V2 manifest path naming scheme. This makes these datasets unreadable for versions of Lance library prior to v0.17.0 (released September 2024). This default improves performance on object storage. See the original PR (lance-format#2798) for details. Close lance-format#5634

This introduces
ManifestNamingSchemewith aV1andV2variant.V1is the existing naming scheme.V2uses a scheme optimized for object storage listing mechanisms, making looking up the latest manifest constant time.On S3, this makes
lance.dataset()take only 125 ms, regardless of how many versions of the dataset existed. Previously this time grew linearly with number of versions.We also provide a method
migrate_manifest_paths_v2. Because this method alters the manifest path scheme, and agreement on the scheme is critical for the write path, it's important is it not run while any read or write operations happen on that table. That's why this migration can't happen in the background like some other migrations.Closes #2790