Fix checkpoint cleanup failure (#688) #689
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes a failure that could occur in checkpoint cleanup in situations where a table exists in one epoch but not in a previous epoch. To clean up a checkpoint, we follow the following procedure:
To actually determine the files that are referenced, we have to look at the table metadata to figure out the table type and config. That involves looping through all of the tables referenced in a particular operator checkpoint. However, it turns out there was a subtle bug where we were using the metadata from (1) for each iteration of (2). That meant that if there was an table that existed only in the new_min epoch but not in the previous checkpoints, we would fail to find it in the older one and panic.
The fix is to ensure we are always iterating over the tables of the epoch that we're cleaning.
Closes #688