-
Notifications
You must be signed in to change notification settings - Fork 38
use separate manifest files for delete files #228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let latest_version = table | ||
| .object_store() | ||
| .get(&object_store::path::Path::from(format!( | ||
| "{table_dir}/metadata/version-hint.text" | ||
| ))) | ||
| .await | ||
| .unwrap() | ||
| .bytes() | ||
| .await | ||
| .unwrap(); | ||
| let latest_version_str = std::str::from_utf8(&latest_version).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bit unfortunate, since i realized that:
- just loading the table won't give you the full path to latest metadata file
- the
version-hint.text(which does contain the full path) itself is not readable by duckdbIceberg metadata file not found for table version '/Users/gruuya/EDB/store/warehouse/tpch/query12/. metadata/00000-3f569e94-5601-48f3-9199-8d71df4ea7b0.metadata.json' using 'none' compression and format(s): 'v%s%s.metadata.json,%s%s.metadata.json'
There are ways to customize how duckdb parses the version to read, but i think those work only when the version file is not detected in the first place. Either way it's a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is the naming of the metadata.json wrong then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming is fine, it's just that version-hint.text should contain 00000-3f569e94-5601-48f3-9199-8d71df4ea7b0 (e.g. for a file named 00000-3f569e94-5601-48f3-9199-8d71df4ea7b0.metadata.json), so that it complies with the format %s%s.metadata.json (the other %s is for the compression codec.
I also know spark doesn't expect that format either
In [30]: df = spark.read.format("iceberg").load("/Users/gruuya/EDB/store/test/orders")
25/07/16 09:12:42 WARN HadoopTableOperations: Error reading version hint file /Users/gruuya/EDB/store/test/orders/metadata/version-hint.text
java.lang.NumberFormatException: For input string: "/Users/gruuya/EDB/store/test/orders/metadata/v2.metadata.json"but I'm not sure exactly what's the proper format for it then.
| selected_data_manifest: Some(data_manifest), | ||
| selected_delete_manifest: delete_manifest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that there are both data and delete manifest entries selected. It's more explicit and better than in my pr where i only pick to continue in the one which has more new files to be added to.
| }) | ||
| .ok_or(Error::NotFound("Manifest for insert".to_owned())) | ||
| let (_, data_manifest) = | ||
| selected_data_state.ok_or(Error::NotFound("Manifest for insert".to_owned()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this shouldn't be an error but instead, the entire Option<ManifestListEntry> should be returned, since the caller (ManifestListWriter::from_existing) knows how to handle that correctly.
Granted I think this scenario can only happen if we have an empty/new table and we wont to append some deletes to it, or some such edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think think there is no valid scenario where there shouldn't be a data manifest selected here. And I think not having an Option in SelectedManifest is the most explicit way to express that. This way an Error has to be thrown, which in my opinion is the correct behavior. If you pass the Option to the caller, you rely on correct handling.
|
Btw those two failing mat-view tests are solved by this 0dc313a (otherwise you end up generating identical data and delete manifest paths). |
Awesome, thanks! I was stuck on how to solve that one. |
|
Fwiw, I've also been seeing the no-disk-space error during trino image pull. I've done some diagnosing on it and it seems like |
provide more context for unimplemented macro
7310bf6 to
57a08c3
Compare
use separate manifest files for delete files
use separate manifest files for delete files
use separate manifest files for delete files
use separate manifest files for delete files
use separate manifest files for delete files
use separate manifest files for delete files
use separate manifest files for delete files
use separate manifest files for delete files
This PR implements an alternative approach to solve #212.
It selects two manifest files, one for data and one for delete files. The aim is to limit the number of additional manifest files that have to be created.