Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error optimizing large table #1419

Closed
cmackenzie1 opened this issue May 31, 2023 · 8 comments
Closed

Error optimizing large table #1419

cmackenzie1 opened this issue May 31, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@cmackenzie1
Copy link
Contributor

cmackenzie1 commented May 31, 2023

Environment

Delta-rs version: 0.12

Binding: rust

Environment:

  • Cloud provider:
  • OS:
  • Other:

Bug

What happened:

Issuing optimize on a large table failed with

called `Result::unwrap()` on an `Err` value: Generic("Expected writer to return only one add action")
thread 'tests::test_optimize' panicked at 'called `Result::unwrap()` on an `Err` value: Generic("Expected writer to return only one add action")', src/main.rs:312:71

What you expected to happen:

Optimize to succeed

How to reproduce it:

Call DeltaOps::from(table).optimize().await.unwrap(); on a table where the writer buffer would be filled when reading part of a source file.

More details:

Table consists of 37040 commits and ~75k files.

@cmackenzie1 cmackenzie1 added the bug Something isn't working label May 31, 2023
@cmackenzie1
Copy link
Contributor Author

if add_actions.len() != 1 {
// Ensure we don't deviate from the merge plan which may result in idempotency being violated
return Err(DeltaTableError::Generic(
"Expected writer to return only one add action".to_owned(),
));
}

@cmackenzie1
Copy link
Contributor Author

From first glance, if the writer buffer exceeds the target_file_size, it gets flushed to storage, increasing the number of Add operations, which is throwing this error.

Is the check add_actions.len() != 1 actually helping here or is that not actually needed?

@roeap
Copy link
Collaborator

roeap commented May 31, 2023

Without having checked the code, in principle the bin packing should bin files such that they yield a single resulting file. My guess would be that either we have a bug in in the binning logic, or written file size is not as predictable as assumed..

Are we e.g. using the same compression during initial write and optimization?

@wjones127
Copy link
Collaborator

I think the problem is we don't set compression by default when optimizing. Also written file size can be unpredictable.

@wjones127
Copy link
Collaborator

I'll try to address this in my PR #1383.

@cmackenzie1
Copy link
Contributor Author

@wjones127 were you able to address this in the aforementioned PR or is it still outstanding?

@wjones127
Copy link
Collaborator

It will be addressed in that PR.

@cmackenzie1
Copy link
Contributor Author

Closing this as it seems to be resolved in #1383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants