Skip to content

Conversation

@JanKaul
Copy link
Owner

@JanKaul JanKaul commented Aug 5, 2025

This PR implements an alternative approach to solve #212.

It selects two manifest files, one for data and one for delete files. The aim is to limit the number of additional manifest files that have to be created.

Comment on lines +207 to +217
let latest_version = table
.object_store()
.get(&object_store::path::Path::from(format!(
"{table_dir}/metadata/version-hint.text"
)))
.await
.unwrap()
.bytes()
.await
.unwrap();
let latest_version_str = std::str::from_utf8(&latest_version).unwrap();
Copy link
Contributor

@gruuya gruuya Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit unfortunate, since i realized that:

  • just loading the table won't give you the full path to latest metadata file
  • the version-hint.text (which does contain the full path) itself is not readable by duckdb
    Iceberg metadata file not found for table version '/Users/gruuya/EDB/store/warehouse/tpch/query12/. metadata/00000-3f569e94-5601-48f3-9199-8d71df4ea7b0.metadata.json' using 'none' compression and format(s): 'v%s%s.metadata.json,%s%s.metadata.json'
    

There are ways to customize how duckdb parses the version to read, but i think those work only when the version file is not detected in the first place. Either way it's a separate issue.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is the naming of the metadata.json wrong then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming is fine, it's just that version-hint.text should contain 00000-3f569e94-5601-48f3-9199-8d71df4ea7b0 (e.g. for a file named 00000-3f569e94-5601-48f3-9199-8d71df4ea7b0.metadata.json), so that it complies with the format %s%s.metadata.json (the other %s is for the compression codec.

I also know spark doesn't expect that format either

In [30]: df = spark.read.format("iceberg").load("/Users/gruuya/EDB/store/test/orders")
25/07/16 09:12:42 WARN HadoopTableOperations: Error reading version hint file /Users/gruuya/EDB/store/test/orders/metadata/version-hint.text
java.lang.NumberFormatException: For input string: "/Users/gruuya/EDB/store/test/orders/metadata/v2.metadata.json"

but I'm not sure exactly what's the proper format for it then.

Comment on lines +401 to +404
selected_data_manifest: Some(data_manifest),
selected_delete_manifest: delete_manifest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that there are both data and delete manifest entries selected. It's more explicit and better than in my pr where i only pick to continue in the one which has more new files to be added to.

})
.ok_or(Error::NotFound("Manifest for insert".to_owned()))
let (_, data_manifest) =
selected_data_state.ok_or(Error::NotFound("Manifest for insert".to_owned()))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this shouldn't be an error but instead, the entire Option<ManifestListEntry> should be returned, since the caller (ManifestListWriter::from_existing) knows how to handle that correctly.

Granted I think this scenario can only happen if we have an empty/new table and we wont to append some deletes to it, or some such edge case.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think think there is no valid scenario where there shouldn't be a data manifest selected here. And I think not having an Option in SelectedManifest is the most explicit way to express that. This way an Error has to be thrown, which in my opinion is the correct behavior. If you pass the Option to the caller, you rely on correct handling.

@gruuya
Copy link
Contributor

gruuya commented Aug 5, 2025

Btw those two failing mat-view tests are solved by this 0dc313a (otherwise you end up generating identical data and delete manifest paths).

@JanKaul
Copy link
Owner Author

JanKaul commented Aug 5, 2025

Btw those two failing mat-view tests are solved by this 0dc313a (otherwise you end up generating identical data and delete manifest paths).

Awesome, thanks! I was stuck on how to solve that one.

@JanKaul JanKaul marked this pull request as ready for review August 6, 2025 07:49
@gruuya
Copy link
Contributor

gruuya commented Aug 6, 2025

Fwiw, I've also been seeing the no-disk-space error during trino image pull.

I've done some diagnosing on it and it seems like /usr/local is inflated, and just by removing /usr/local/lib/android around 10GB could be reclaimed.

@JanKaul JanKaul force-pushed the fix-equalitly-deletes branch from 7310bf6 to 57a08c3 Compare August 6, 2025 12:13
@JanKaul JanKaul merged commit 85112a1 into main Aug 6, 2025
2 checks passed
@JanKaul JanKaul deleted the fix-equalitly-deletes branch August 6, 2025 13:57
JanKaul added a commit that referenced this pull request Aug 11, 2025
use separate manifest files for delete files
JanKaul added a commit that referenced this pull request Aug 12, 2025
use separate manifest files for delete files
JanKaul added a commit that referenced this pull request Aug 21, 2025
use separate manifest files for delete files
JanKaul added a commit that referenced this pull request Aug 27, 2025
use separate manifest files for delete files
JanKaul added a commit that referenced this pull request Aug 28, 2025
use separate manifest files for delete files
JanKaul added a commit that referenced this pull request Aug 28, 2025
use separate manifest files for delete files
JanKaul added a commit that referenced this pull request Sep 5, 2025
use separate manifest files for delete files
JanKaul added a commit that referenced this pull request Sep 16, 2025
use separate manifest files for delete files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants