Fix checkpoint cleanup failure (#688) #689

mwylde · 2024-07-17T22:40:52Z

Fixes a failure that could occur in checkpoint cleanup in situations where a table exists in one epoch but not in a previous epoch. To clean up a checkpoint, we follow the following procedure:

Get the metadata for the "new min" epoch (the oldest one that won't be cleaned) and look at all of the files that it references
For each epoch that we are cleaning, get their metadata and look at all of the files that they reference
For every file in (2) that's not in (1), delete it

To actually determine the files that are referenced, we have to look at the table metadata to figure out the table type and config. That involves looping through all of the tables referenced in a particular operator checkpoint. However, it turns out there was a subtle bug where we were using the metadata from (1) for each iteration of (2). That meant that if there was an table that existed only in the new_min epoch but not in the previous checkpoints, we would fail to find it in the older one and panic.

The fix is to ensure we are always iterating over the tables of the epoch that we're cleaning.

Closes #688

Fix checkpoint cleanup failure (#688)

337888f

mwylde enabled auto-merge (squash) July 17, 2024 22:43

mwylde merged commit 3f2e94b into master Jul 17, 2024
6 checks passed

mwylde added a commit that referenced this pull request Jul 30, 2024

Fix checkpoint cleanup failure (#688) (#689)

62ff145

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix checkpoint cleanup failure (#688) #689

Fix checkpoint cleanup failure (#688) #689

mwylde commented Jul 17, 2024

Fix checkpoint cleanup failure (#688) #689

Fix checkpoint cleanup failure (#688) #689

Conversation

mwylde commented Jul 17, 2024