Deleted Object Retention - Committed #1932
Labels
new-feature
Issues that introduce new feature or capability
proposal
roadmap/hard-delete
https://docs.lakefs.io/understand/roadmap.html#operations
Milestone
User story
Users should be able specify the period of time objects will be available after deleted. The user will be able to define a default period, and a period per branch.
Requirements
Examples
Simple example: cleanup on the main branch
Consider the following history on the main branch:

Suppose the retention period is 7 days, and we are running the cleanup job now. We note that 7 days ago, the branch's HEAD pointed to commit B. Therefore, after the cleanup job:
example1
still exists in the storage, since it is accessible from commit B.example2
still exists in the storage, since it is accessible from commit B.example3
does not exist in the storage, since it is not accessible from commit B, or any of its descendants.Complex example: cleanup on multiple branches
Consider the following example, where a

feature1
branch is created from themain
branch.On each of the branches we create a new file, and delete it in a subsequent commit. We also delete the file
example1
in the main branch. Suppose the retention period is 7 days for the main branch, and 3 days for the feature branch.Note:
After the cleanup job:
example2
still exists in the storage, since it is accessible from commit B.example3
does not exist in the storage, since it is not accessible from B, D, or any of their descendants.example1
still exists in the storage, since it is accessible from commit D. Note that it is maintained even though it was deleted before the retention period of both branches.Proposal
Configuring the retention period.
The user will have the option to define a default retention time (in days) and one for each desirable branch.
For example:
The cleanup job
Definitions
Branch main ancestry: the set of commits generated by recursively getting a commit’s first parent, starting with the branch’s HEAD.
Active commits: commits performed within the retention period of a branch, and are reachable in the branch’s main ancestry.
Expired commits: all commits that are not active.
Algorithm
The user can run a spark job that is in charge of removing all outdated data. Here is the outline for this job:
Dangling Commits
Definition: a dangling commit is a commit not accessible from the main ancestry of any branch. That is, it cannot be reached from any branch's HEAD by recursively taking a commit's first parent.
Observation: The algorithm described above will never scan dangling commits.
Bad suggestion: mark dangling commits as hypothetical HEADs for the sake of the algorithm.
If we do that, those commits will never become expired, since our algorithm always marks HEADs as active.
Suggestion: To avoid complicating the algorithm - for every dangling commit, add a dummy child commit with the same creation date. Mark this dummy commit as a hypothetical HEAD for the algorithm input. Use the repository's default retention threshold for this commit and its ancestry.
Example: Dangling Commits
Let's take the previous example, after the feature branch was deleted and left its HEAD dangling.
The suggested algorithm will add a new dummy child to the commit, and mark it as a hypothetical HEAD.
If the default retention period for the repository is 7 days, both C and D will be marked as active, because 7 days ago the hypothetical HEAD would have pointed at C. However, if the default retention period is 3 days, both C and D will be marked expired, because 3 days ago the hypothetical HEAD would have pointed at the dummy commit.
The text was updated successfully, but these errors were encountered: