Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint cleanup failure (#688) #689

Merged
merged 1 commit into from
Jul 17, 2024
Merged

Fix checkpoint cleanup failure (#688) #689

merged 1 commit into from
Jul 17, 2024

Conversation

mwylde
Copy link
Member

@mwylde mwylde commented Jul 17, 2024

Fixes a failure that could occur in checkpoint cleanup in situations where a table exists in one epoch but not in a previous epoch. To clean up a checkpoint, we follow the following procedure:

  1. Get the metadata for the "new min" epoch (the oldest one that won't be cleaned) and look at all of the files that it references
  2. For each epoch that we are cleaning, get their metadata and look at all of the files that they reference
  3. For every file in (2) that's not in (1), delete it

To actually determine the files that are referenced, we have to look at the table metadata to figure out the table type and config. That involves looping through all of the tables referenced in a particular operator checkpoint. However, it turns out there was a subtle bug where we were using the metadata from (1) for each iteration of (2). That meant that if there was an table that existed only in the new_min epoch but not in the previous checkpoints, we would fail to find it in the older one and panic.

The fix is to ensure we are always iterating over the tables of the epoch that we're cleaning.

Closes #688

@mwylde mwylde enabled auto-merge (squash) July 17, 2024 22:43
@mwylde mwylde merged commit 3f2e94b into master Jul 17, 2024
6 checks passed
mwylde added a commit that referenced this pull request Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reoccurring errors with checkpointing
1 participant