Skip to content

Conversation

@gruuya
Copy link
Contributor

@gruuya gruuya commented Aug 4, 2025

  • make sure manifest list entries are exclusively for data or delete content, and never mixed, since that will confuse other readers
  • when executing the append operation try to find and re-use an existing manifest list entry with content that has the most files to add, instead of always appending to data entries
  • extend the equality delete integration test, and verify DuckDB can read it correctly as well

Closes #212.


manifest.manifest_path =
new_manifest_location(&self.table_metadata.location, &self.commit_uuid, 0);
new_manifest_location(&self.table_metadata.location, &self.commit_uuid, content as usize);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so that data and delete manifest path is different, otherwise if both are present the deletes will overwrite the data (as they're applied second).

This also needs fixing in the append_multiple_filtered, which could be merged with append_filtered; need to think about it some more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i solved this by having a per-instance manifest counter, which then helps to determine the next manifest id to use in the file name.

@gruuya gruuya force-pushed the separate-del-manifest-list-entry branch 6 times, most recently from 19ef3a3 to 373036a Compare August 5, 2025 09:15
@JanKaul
Copy link
Owner

JanKaul commented Aug 5, 2025

Thanks a lot for the PR. I was also thinking about how to fix this. If I understand the proposal correctly, you create an additional manifest for the delete files for every commit, right?

I'm working on a PR that selects two manifests, one for data and one for delete files. Once I'm finished, we can have look at which approach we prefer. This would reduce the number of manifests being written.

PR: #228

@gruuya gruuya force-pushed the separate-del-manifest-list-entry branch from 373036a to 3f820e7 Compare August 5, 2025 09:28
@gruuya
Copy link
Contributor Author

gruuya commented Aug 5, 2025

If I understand the proposal correctly, you create an additional manifest for the delete files for every commit, right?

Yes, that's basically it. It ensures all delete files are separated into new manifest list entry/entries, which are solely for deletes.

I'm working on a PR that selects two manifests, one for data and one for delete files.

Sounds great! I can take a look at the (draft) PR, but i suppose anything that ensures the above will fix this. The test whether it works is in this PR as well, it's the change in test_equality_delete which uses duckdb to ensure it also agrees on the contents of the latest snapshot.

@JanKaul
Copy link
Owner

JanKaul commented Aug 5, 2025

Yeah I saw your test. It's great, thanks!

@gruuya
Copy link
Contributor Author

gruuya commented Aug 6, 2025

Closing in favor of #228.

@gruuya gruuya closed this Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect manifest list entries for (equality) delete files

2 participants