Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manifest delete is leaking indexes when deleting client-managed referrer #1228

Closed
1 task done
jlbutler opened this issue Jan 3, 2024 · 19 comments
Closed
1 task done
Labels
bug Something isn't working stale Inactive issues or pull requests triage New issues or PRs to be acknowledged by maintainers

Comments

@jlbutler
Copy link

jlbutler commented Jan 3, 2024

What happened in your environment?

oras manifest delete failed to delete a referrer in an ECR repo. ECR will protect against deletion of content listed in an index, but the operation removes the reference, then deletes.

It turns out that when the index is updated, a new one is pushed but the old one remains. ECR will not delete the manifest as it is still listed in the old index.

What did you expect to happen?

Manifest deleted, updated index in place with referrer removed

How can we reproduce it?

// current client-managed referrers index
$ curl -sX GET -H "Authorization: Basic $TOKEN" https://510431938379.dkr.ecr.us-west-2.amazonaws.com/v2/test-delete/manifests/sha256:29cd94faa1e63cf0052e95e39b222fb3e4bea8de4025b92817c1e447f51737f4 | jq . | grep 78ba
"digest": "sha256:78bab7e405d0d2bac6b410e8873e67a004780e57e71006be8f5c85700cbc080a",

// delete fails
$ oras manifest delete --force --distribution-spec v1.1-referrers-tag 510431938379.dkr.ecr.us-west-2.amazonaws.com/test-delete@sha256:78bab7e405d0d2bac6b410e8873e67a004780e57e71006be8f5c85700cbc080a
Error: failed to delete 510431938379.dkr.ecr.us-west-2.amazonaws.com/test-delete@sha256:78bab7e405d0d2bac6b410e8873e67a004780e57e71006be8f5c85700cbc080a: DELETE "https://510431938379.dkr.ecr.us-west-2.amazonaws.com/v2/test-delete/manifests/sha256:78bab7e405d0d2bac6b410e8873e67a004780e57e71006be8f5c85700cbc080a": response status code 500: Internal Server Error

// new index has been updated
$ curl -sX GET -H "Authorization: Basic $TOKEN" https://510431938379.dkr.ecr.us-west-2.amazonaws.com/v2/test-delete/manifests/sha256:d49c218d6a49444fbba31546481402bd2d18db14072b304463d210515e3e2df8 | jq . | grep 78ba

// old index remains
$ curl -sX GET -H "Authorization: Basic $TOKEN" https://510431938379.dkr.ecr.us-west-2.amazonaws.com/v2/test-delete/manifests/sha256:29cd94faa1e63cf0052e95e39b222fb3e4bea8de4025b92817c1e447f51737f4 | jq . | grep 78ba
"digest": "sha256:78bab7e405d0d2bac6b410e8873e67a004780e57e71006be8f5c85700cbc080a",

What is the version of your ORAS CLI?

1.1.0

What is your OS environment?

macOS / M2

Are you willing to submit PRs to fix it?

  • Yes, I am willing to fix it.
@jlbutler jlbutler added bug Something isn't working triage New issues or PRs to be acknowledged by maintainers labels Jan 3, 2024
@qweeah
Copy link
Contributor

qweeah commented Jan 4, 2024

ORAS will not clean up old referrers index after attaching using referrers tag schema.

If I understand correctly(see below), both the old index and old referrers are garbage. To do GC from client side, you need to delete both 29cd94faa1e and 78bab7e405d

flowchart LR
    A("old referrer index\n29cd94faa1e") --> C("old referrer\n78bab7e405d")
Loading

ECR will not delete the manifest as it is still listed in the old index.

With that implementation 29cd94faa1e need to be deleted first.

@qweeah
Copy link
Contributor

qweeah commented Jan 4, 2024

BTW ORAS CLI used to support referrers index cleanup in v1.1.0-rc.1 but it's too error prone and got reverted in the v1.1.0 release.

@sajayantony
Copy link
Contributor

@toddysm didn’t we decide to not clean up the old manifest because docker hub didn’t support manifest delete? Should we have flag and make delete the default given that DH is now turning out to be a special case?

@jlbutler
Copy link
Author

jlbutler commented Jan 4, 2024

If I understand correctly(see below), both the old index and old referrers are garbage.

I think in the context of gc, these are two different things. The old referrer is the thing the user deleted, so not only garbage but it should be deleted. Whereas, the index is managed by the client, and there's only an "old" one because we can't edit it in place. So the new index clobbers the old in that we move the sha256- tag, and the old one is just left in place. It should be deleted, I think.

Should we have flag and make delete the default given that DH is now turning out to be a special case?

I think that makes sense. I couldn't think of a use case where we'd want to retain the old index. If the only reason is that DH doesn't allow deletion, that seems more like a workaround to me. So, agree with you @sajayantony seems like GC by default would be good, with an optional flag to workaround it as needed.

@toddysm
Copy link

toddysm commented Jan 4, 2024

@sajayantony my recollection of the discussion was to have the flag to force the deletion, but the default behavior is to not delete (i.e. user needs to explicitly ask for deletion). This is inconvenient and inconsistent because you normally don't know if the registry supports deletion until you hit the error but this is the most we can do to support all registries.

The other option we discussed is to degrade the deletion error to warning to ensure automation passes (else automated tasks break for registries that don't support deletion).

Hope this helps.

@qweeah
Copy link
Contributor

qweeah commented Jan 5, 2024

@sajayantony @toddysm Here is a chronological view of supporting GC old referrers index

  1. ORAS v1.0.0 is the last stable version that supports auto GC of old referrers index
  2. Auto cleanup returns false negative for registries that doesn't support deletion, so Improve the error msg for OCI v1.0 registries or registries that don't support deletion API #915 is created.
  3. allow oras command to skip referrer index clean up #954 and add an option to allow skip dangling referrers index deletion error for oras.Copy oras-go#510 are created and merged to support using a new flag for GC.
  4. The GC flag is included in v1.1.0-rc.1
  5. The GC flag is reverted and marked as deprecation in v1.1.0-rc.2

@jlbutler
Copy link
Author

jlbutler commented Jan 5, 2024

Thanks for the conversation, all!

I certainly understand the goal to not have an action like "manifest push" or "attach" result in a weird error about deletion failing. But it seems like we traded off working around the error on some registries by keeping the stale copies always.

but the default behavior is to not delete (i.e. user needs to explicitly ask for deletion)

How does a user do that? That would work (but I still think the default should be the other way around). But, there is no option that I can see to ask for this deletion?

I have to say, the use of the term garbage collection confuses me here, and I wonder if some of the conversation isn't in context of referrers, vs the stale index copy. These stale indexes are a side effect of how the client manages updates to the referrers index. It doesn't seem like garbage collection, but more a step in that operation that we aren't doing.

To further complicate things on ECR, I cannot even delete the referrer I am trying to delete, because ECR protects against deletion of manifests which are referred to in an existing index. Since the old stale index is still in the repo, I have to manually delete it.

I might still be misunderstanding the past conversations here, but it seems the default behavior should be to delete the stale and now useless copy of the index, with an option to retain in on those registries that don't permit me to delete it (like Docker Hub, which implements delete through the DH API vs the OCI API). It could also certainly work to have an option to force delete (though again, that seems like the expected behavior and should probably be default?).

@qweeah
Copy link
Contributor

qweeah commented Jan 5, 2024

To further complicate things on ECR, I cannot even delete the referrer I am trying to delete, because ECR protects against deletion of manifests which are referred to in an existing index. Since the old stale index is still in the repo, I have to manually delete it.

Managing referrers index is just a client-side workaround when Referrers API is not available. As a workaround, it's not perfect and there are several known issues of the workaround, e.g. there is no way to guarantee the data consistency when a referrers index is updated concurrently.

To provide production-level registry service, I believe eventually ECR will implement Referrers API. If ECR uses referrers index as a temporary solution of Referrers API, ORAS CLI can bring back the --skip-delete-referrers flag so the obsolete index can be deleted before deleting the referrer artifact.

@jlbutler
Copy link
Author

jlbutler commented Jan 5, 2024

Managing referrers index is just a client-side workaround

Yes, I fully understand and am really looking forward to a spec release for /referrers. And we knew as we initially discussed it (so long ago, now) that it would be imperfect. That's ok, as long as clients do the best they can, it'll be workable.

I believe eventually ECR will implement Referrers API

For posterity, the implementation is done (and has been for a long while) and we maintain it against the upstream changes. We just need a spec to GA so we can release it.

All this said, there will no doubt be 1.0-only registries going into the future, as some registries may never adopt the proposed 1.1 features. Unless the project intends to deprecate the client-managed options proposed in the Distribution spec 1.1 RC, I think we should consider this as a permanent option.

ORAS CLI can bring back the --skip-delete-referrers flag

Ok, I'll look into a PR for this for your review. I think the name is a bit confusing, what do you think about --preserve-referrers-index?

@toddysm
Copy link

toddysm commented Jan 5, 2024

Just curious, why did we decide to deprecate the flag in RC2? I may have missed that and justification for it.

@shizhMSFT
Copy link
Contributor

@jlbutler @qweeah @sajayantony @toddysm Thanks for the conversation above related to garbage collecting the outdated referrers index. To put us on the same page, I'd like to summarize the issue statement, its cause, and related history.

Issue Analysis

oras CLI leverages the distribution-spec v1.1.0-rc3-compliant oras-go for managing the referrers tag schema when referrers API is not available. Precisely, oras-go does the following steps on a referrer manifest deletion:

  1. Ping the referrers API if the availability of the referrers API is unknown.
    • oras CLI always pings the referrers API as it has no context on previous commands.
  2. If referrers API is available, skip to step 7. Otherwise, continue as referrers tag schema specifies.
  3. Pull the referrers index.
  4. Remove the descriptor entry from the array of manifests that references the manifest to be deleted.
  5. If SkipReferrersGC is toggled, push the updated referrers index, and skip to step 7.
  6. Otherwise, push the updated referrers index if the resulted array of manifest is not empty. Then delete the outdated referrers index.
  7. Delete the specified referrer manifest.

According to @jlbutler, ECR does not support the referrers API and falls back to the referrers tag schema. Since ECR ensures graph consistency, the last step (step 7) fails as the target referrer manifest is still referenced by the outdated referrers index. As @qweeah mentioned, bring back the flag of setting SkipReferrersGC to false will resolve this issue.

Related Discussions

The distribution-spec specifies that clients SHOULD push the updated referrers list (i.e. referrers index) when deleting manifests. However, the spec does not specify how clients should deal with the outdated referrers index. Therefore, whether to delete the outdate referrers index relies on the client implementations.

With the same rationale as the #1228 (comment) by @jlbutler, oras v1.0.x deletes the outdated referrers index on update (add or delete entries). Sooner, we found this strategy does not work on append-only registries like DockerHub (#915). Thus, the flag --skip-delete-referrers was proposed (#954), implemented (#957), and released in oras v1.1.0-rc.1.

While the flag --skip-delete-referrers resolves the immediate technical problem, it causes usability issues such as the non-zero exit code issue. To streamline the oras CLI and resolve usability issues, @sajayantony proposed removing the flag --skip-delete-referrers (#1041) with registry behavior aligned (details in #1041 should answer @toddysm's question). The proposal is implemented by #1060 and #1059, and released in oras v1.1.0-rc.2.

@jlbutler
Copy link
Author

jlbutler commented Jan 9, 2024

Thanks for this detailed summary, @shizhMSFT. It's great for this conversation as well as for posterity's sake.

One remaining question regarding a new option and its default value. In 1.0 without options, the stale index was deleted by default and there was no option to skip that. In 1.1 the default is to not delete the stale index, and there is no option to delete it.

What are our thoughts at this point around backward compatibility? Is it ok to once again delete it as with 1.0, and offer an option to preserve? Or should we not delete it as with 1.1, and offer an option to delete?

Once we are agreed on this, I'll follow up with a PR.

@shizhMSFT
Copy link
Contributor

Based on the intention of #1041 (especially the alignment with registry behavior), I prefer not changing the default behavior. In other words, I prefer

Or should we not delete it as with 1.1, and offer an option to delete?

@jlbutler
Copy link
Author

jlbutler commented Jan 9, 2024

The question in my opinion is really just around backward compatibility of the Oras client, rather than alignment with registry behavior. As we have discussed in this thread, registry behavior between implementations is inconsistent. These differences are OCI compliant, in that registries MAY implement deletion.

To bring the context back to the history of client-managed index behavior, from above:

  1. The GC flag is included in v1.1.0-rc.1
  2. The GC flag is reverted and marked as deprecation in v1.1.0-rc.2

To me, 4) seems like a workaround for an integration bug found with 1.0 for registries that don't support deletion, while 5) changes default behavior to work around that bug permanently, but introduces other issues by dropping the option altogether. 1.0 traded an integration issue for a different one in 1.1. It also introduced a breaking change, not by dropping an experimental option, but by changing default behavior. This 1.1 default behavior also has the side effect of creating stale content in the repository.

Given all this, I lean toward wanting to revert to 1.0 default behavior, with an optional skip for delete (as in 1.1.0-rc1). This would restore default GA behavior, but also does not retain stale content created by the client by default. If delete is not enabled in the remote registry, then an option to skip deletion works around that. Never deleting by default means stale content builds up over time, as a side effect of an update operation implemented by the client.

As I said, I'll support maintainers' preference on this, I just want to make sure we're looking at the whole issue.

And I know I missed prior conversations, but I think this statement from #1041 was fairly hopeful:

However, with the increasing adoption of the referrers API among registries...

I think the project should assume that even beyond 1.1 GA it's likely that some registries won't ever support its features. It seems in a lot of threads the client-managed index is seen as temporary, but it might be worth considering it more of a legacy support option.

Sorry this got long, just want to make sure I'm making my point clearly. I'll start going in the direction of default retain stale index, but we can hopefully discuss further.

@FeynmanZhou
Copy link
Member

@jlbutler If we agree that this is an valid enhancement request, would you be able to own it? Do you want to bring this fix into the upcoming release (ORAS v1.2.0) by Feb 6?

@jlbutler
Copy link
Author

jlbutler commented Feb 6, 2024

@jlbutler If we agree that this is an valid enhancement request, would you be able to own it? Do you want to bring this fix into the upcoming release (ORAS v1.2.0) by Feb 6?

Obviously I didn't get this handled by Feb 6. I'm still willing to do it, just got pulled away on some other things. Can get back to it in the next week or so.

Copy link

github-actions bot commented Apr 7, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale Inactive issues or pull requests label Apr 7, 2024
@qweeah qweeah removed the stale Inactive issues or pull requests label Apr 7, 2024
Copy link

github-actions bot commented Jun 7, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale Inactive issues or pull requests label Jun 7, 2024
Copy link

github-actions bot commented Jul 7, 2024

This issue was closed because it has been stalled for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Inactive issues or pull requests triage New issues or PRs to be acknowledged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants