Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleted Object Retention - Committed #1932

Closed
johnnyaug opened this issue May 10, 2021 · 2 comments
Closed

Deleted Object Retention - Committed #1932

johnnyaug opened this issue May 10, 2021 · 2 comments
Assignees
Labels
new-feature Issues that introduce new feature or capability proposal roadmap/hard-delete https://docs.lakefs.io/understand/roadmap.html#operations
Milestone

Comments

@johnnyaug
Copy link
Contributor

johnnyaug commented May 10, 2021

User story

Users should be able specify the period of time objects will be available after deleted. The user will be able to define a default period, and a period per branch.

Requirements

  • A cleanup job can be manually/periodically triggered.
  • After the cleanup job has run:
    • An object will exist in the storage iff it existed in the HEAD of some branch within the retention period of that branch.
    • In particular, data existing only on branches that were deleted prior to the process should be deleted.
  • Objects outside of storage namespace (e.g ingested data) should not be deleted.
  • Staging data (of all branches) should not be deleted.

Examples

Simple example: cleanup on the main branch

Consider the following history on the main branch:
image

Suppose the retention period is 7 days, and we are running the cleanup job now. We note that 7 days ago, the branch's HEAD pointed to commit B. Therefore, after the cleanup job:

  1. Object example1 still exists in the storage, since it is accessible from commit B.
  2. Object example2 still exists in the storage, since it is accessible from commit B.
  3. Object example3 does not exist in the storage, since it is not accessible from commit B, or any of its descendants.

Complex example: cleanup on multiple branches

Consider the following example, where a feature1 branch is created from the main branch.
image

On each of the branches we create a new file, and delete it in a subsequent commit. We also delete the file example1 in the main branch. Suppose the retention period is 7 days for the main branch, and 3 days for the feature branch.

Note:

  • 7 days ago, the HEAD for the main branch was commit B.
  • 3 days ago, the HEAD for the feature branch was commit D.

After the cleanup job:

  1. Object example2 still exists in the storage, since it is accessible from commit B.
  2. Object example3 does not exist in the storage, since it is not accessible from B, D, or any of their descendants.
  3. Object example1 still exists in the storage, since it is accessible from commit D. Note that it is maintained even though it was deleted before the retention period of both branches.

Proposal

Configuring the retention period.

The user will have the option to define a default retention time (in days) and one for each desirable branch.
For example:

lakectl gc set-default 4 
lakectl gc set main 20
lakectl gc set other-branch 2

The cleanup job

Definitions

Branch main ancestry: the set of commits generated by recursively getting a commit’s first parent, starting with the branch’s HEAD.
Active commits: commits performed within the retention period of a branch, and are reachable in the branch’s main ancestry.
Expired commits: all commits that are not active.

Algorithm

The user can run a spark job that is in charge of removing all outdated data. Here is the outline for this job:

  1. Find the set of all Active commits.
  2. Add the first parent of every Active commit to the set.
  3. Find all Expired commits - this could be optimized by not using all expired commits.
  4. Find the set A of ranges existing in active commits.
  5. Find the set B of ranges existing in expired commits and not in active commits.
  6. Find the set O of objects that are present in ranges from B but not in ranges from A.
  7. Delete objects in set O.

Dangling Commits

Definition: a dangling commit is a commit not accessible from the main ancestry of any branch. That is, it cannot be reached from any branch's HEAD by recursively taking a commit's first parent.

Observation: The algorithm described above will never scan dangling commits.

Bad suggestion: mark dangling commits as hypothetical HEADs for the sake of the algorithm.
If we do that, those commits will never become expired, since our algorithm always marks HEADs as active.

Suggestion: To avoid complicating the algorithm - for every dangling commit, add a dummy child commit with the same creation date. Mark this dummy commit as a hypothetical HEAD for the algorithm input. Use the repository's default retention threshold for this commit and its ancestry.

Example: Dangling Commits

Let's take the previous example, after the feature branch was deleted and left its HEAD dangling.
The suggested algorithm will add a new dummy child to the commit, and mark it as a hypothetical HEAD.

image

If the default retention period for the repository is 7 days, both C and D will be marked as active, because 7 days ago the hypothetical HEAD would have pointed at C. However, if the default retention period is 3 days, both C and D will be marked expired, because 3 days ago the hypothetical HEAD would have pointed at the dummy commit.

@johnnyaug johnnyaug added new-feature Issues that introduce new feature or capability proposal labels May 10, 2021
@arielshaqed
Copy link
Contributor

(this is an issue, so hard to review):

  1. What about deleted branches? I take it that all commits that are no longer on a branch are no longer eligible to be kept?
  2. What about merges? I would guess that tracing active commits goes only through the first parent, but I can understand why you might want to go through all parents.

Some comments about the CLI:

  1. Can we have lakectl config gc or similar? Config options at top level are confusing.
  2. All time durations should have a unit. There's no way of knowing what "3" means in the example.
  3. Can we have config-by-prefix or config-by-glob (and then set-default is just shorthand for set --prefix '*', which is kinda-cool)? That way I can say git config gc set --prefix 'user-*' 3d, and all the user branches will expire after 3 days.
  4. I don't think that we should call it "garbage collection", because that term has a different meaning which confuses me. We might also take the opportunity to prefer a term that says what it does (keep stuff around) rather than how it does it (delete dead stuff). Alternatives include:
    a. MS Exchange calls it "retain deleted objects".
    b. It could be "history" or "kept-history".
    c. Or "keep data", etc.

@guy-har guy-har added this to the Retention milestone May 11, 2021
@johnnyaug
Copy link
Contributor Author

johnnyaug commented May 31, 2021

Race Conditions

In cases where deleted files are brought back to life while the GC job is running, they may or may not be deleted. Such actions include:

  • Revert a commit where a file was deleted.
  • Create a branch from an old commit (which was previously not the HEAD of any branch).
  • Changing the retention period of a branch.
  • Creating a branch from an existing branch, but with a shorter retention period.

New files created during the retention job are not included in any of the sets and therefore are not endangered.

@talSofer talSofer added the roadmap/hard-delete https://docs.lakefs.io/understand/roadmap.html#operations label Jun 13, 2021
@guy-har guy-har closed this as completed Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature Issues that introduce new feature or capability proposal roadmap/hard-delete https://docs.lakefs.io/understand/roadmap.html#operations
Projects
None yet
Development

No branches or pull requests

4 participants