Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow multiple incremental commits in optimize #1621

Merged
merged 1 commit into from
Sep 19, 2023

Conversation

kvap
Copy link
Contributor

@kvap kvap commented Sep 9, 2023

Currently "optimize" executes the whole plan in one commit, which might fail. The larger the table, the more likely it is to fail and the more expensive the failure is.

Add an option in OptimizeBuilder that allows specifying a commit interval. If that is provided, the plan executor will periodically commit the accumulated actions.

@github-actions
Copy link

github-actions bot commented Sep 9, 2023

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@kvap kvap changed the title Allow multiple incremental commits in optimize feat: allow multiple incremental commits in optimize Sep 9, 2023
@kvap kvap force-pushed the optimize-in-multiple-commits branch from 66c5045 to acca200 Compare September 9, 2023 07:09
let now = Instant::now();
if !actions.is_empty() && (self.min_commit_interval.map_or(false, |i| now.duration_since(last_commit) > i) || end) {
let actions = std::mem::take(&mut actions);
last_commit = now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if this is updated before the commit success?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters much, because min_commit_interval is supposed to be much larger than the time it takes to do a commit.

@kvap
Copy link
Contributor Author

kvap commented Sep 11, 2023

clippy is complaining that create_merge_plan has too many arguments. After looking at that function, and at the MergePlan struct, I noticed that they take and store some things which, rather than describing the plan itself, describe how to execute it: max_concurrent_tasks, max_spill_size, and min_commit_interval.

Should we move those out of MergePlan and create_merge_plan, and add them as arguments to MergePlan.execute()?

@kvap kvap force-pushed the optimize-in-multiple-commits branch 2 times, most recently from 39a12bb to 236c86a Compare September 11, 2023 11:04
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach seems reasonable so far.

rust/src/operations/optimize.rs Show resolved Hide resolved
@wjones127
Copy link
Collaborator

Should we move those out of MergePlan and create_merge_plan, and add them as arguments to MergePlan.execute()?

Yeah that sounds like a good idea 👍

@kvap kvap force-pushed the optimize-in-multiple-commits branch from 236c86a to 4c09037 Compare September 15, 2023 07:51
@kvap
Copy link
Contributor Author

kvap commented Sep 15, 2023

Should we move those out of MergePlan and create_merge_plan, and add them as arguments to MergePlan.execute()?

Yeah that sounds like a good idea 👍

Done.

@ion-elgreco
Copy link
Collaborator

@kvap is this option also exposed to the python bindings?

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see one more change to the tests before merging.

@ion-elgreco I've created a follow up issue for the Python bindings: #1640

Comment on lines +587 to +590
max_concurrent_tasks: usize,
#[allow(unused_variables)] // used behind a feature flag
max_spill_size: usize,
min_commit_interval: Option<Duration>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but eventually we might want to put these execution settings in a struct.

Comment on lines +308 to +309
let maybe_metrics = plan
.execute(dt.object_store(), &dt.state, 1, 20, None)
.await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add at least one test where we pass in a commit interval? even if it doesn't make an intermediate commit, it would be good to know those code paths run through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test where an intermediate commit is actually created.

Currently "optimize" executes the whole plan in one commit, which might
fail. The larger the table, the more likely it is to fail and the more
expensive the failure is.

Add an option in OptimizeBuilder that allows specifying a commit
interval. If that is provided, the plan executor will periodically
commit the accumulated actions.
@kvap kvap force-pushed the optimize-in-multiple-commits branch from 4c09037 to 15892d9 Compare September 18, 2023 13:44
@rtyler rtyler added this to the Rust v0.16 milestone Sep 19, 2023
@rtyler rtyler merged commit fae39b1 into delta-io:main Sep 19, 2023
21 checks passed
@kvap kvap deleted the optimize-in-multiple-commits branch September 26, 2023 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants