-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful secret removal #1715
Graceful secret removal #1715
Conversation
b6b11f9
to
aa02d07
Compare
Current coverage is 55.76% (diff: 75.00%)@@ master #1715 diff @@
==========================================
Files 96 96
Lines 15005 15020 +15
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 8390 8376 -14
- Misses 5504 5522 +18
- Partials 1111 1122 +11
|
@@ -31,6 +31,34 @@ func secretFromSecretSpec(spec *api.SecretSpec) *api.Secret { | |||
} | |||
} | |||
|
|||
// checkSecretInUse does a best effort to find if the passed in secret is in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a best effort check"
LGTM |
LGTM I'm not 100% convinced by the |
err error | ||
) | ||
|
||
s.store.View(func(tx store.ReadTx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is racy: by the time it returns true, the result may not actually be true and vice versa. The transaction needs to be held for the duration of the action take based upon the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how all of our checks work, and why there is a description that this is "best-effort"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope not. This isn't really best-effort. The actual result is undefined. This is particularly pathological under deletes, as we could be deleting a secret that is actually referenced.
http://stackoverflow.com/a/34550/253486 explains this in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the stackoverflow link, but I understand both the concept of a race-condition and the implications.
The current code does delete secrets that are actually referenced, so by adding this code, we're doing a best-effort attempt at checking conflicts.
@aluzzardi @aaronlehmann talking to @ehazlett seems like he doesn't see the need for a Given this, I'm going to make the I wanted to allow DDC to force remove a secret, which was a feature that was asked by a customer, but I guess they can use annotations to simulate the feature? /cc @jlhawn |
Sorry, I don't know what exactly Sure it would be nice if
I'm 👍 on having the error message contain this list of conflicts. Given that, one could write a simple script which loops over the conflicting services and updates each to remove the secret reference and try to delete the secret again. This does leave a window where another service could be created/updated to use the secret, but if this happens unexpectedly you'd just get another conflict message when you retry, and you should probably discuss it with whoever else is operating on the same resources in the cluster at the same time as you. |
@jlhawn should have been more explicit, my bad. Without force: DELETE operation errors with a list of services that are currently using the secret |
What side effects does this have? If a service task needs to be rescheduled (before the service spec is updated to remove the secret reference) will it fail to be provided that secret value or does it somehow use the last known value? |
I think Dangling references will lead to gradual swarm degradation, which is not a great way to die. If we do need force, I'd rather see this implemented either in the CLI or in UCP. The implementation would update all referenced services and just remove the reference, then trigger a secret removal. |
This could be race-free only if we had to reverse lookup services by secret ID. @aaronlehmann I believe that's impossible with the current store, right? |
Yes. |
@aluzzardi a client-side remove that updates (potentially dozens) of services simultaneously seems like something we shouldn't implement since it will cause the forceful rolling update of all of them. |
@diogomonica That seems like a walk in the park compared to dangling secret references IMO :) |
074fdaf
to
044a71f
Compare
services = append(services, k) | ||
} | ||
|
||
return grpc.Errorf(codes.DataLoss, "secret '%s' is in use by the following service: %v", secretName, strings.Join(services, ", ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronlehmann I'm returning a codes.DataLoss
here. I thought it was pretty adequate, but I'm fine with changing it. Just FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think DataLoss
is correct.
DataLoss indicates unrecoverable data loss or corruption.
Also, what I had in mind here was to say "service" if len(services) == 1
and "services" otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds a bit more code just for a little bit of syntax sugar, but changed in both PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt a little silly asking for it, but I know otherwise someone would complain about it after the fact.
e9a78a2
to
2411b08
Compare
|
||
// removing a secret that exists but is in use fails | ||
_, err = ts.Client.RemoveSecret(context.Background(), &api.RemoveSecretRequest{SecretID: resp.Secret.ID}) | ||
assert.Equal(t, codes.DataLoss, grpc.Code(err), grpc.ErrorDesc(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is still checking for DataLoss
.
2411b08
to
1611080
Compare
serviceStr = "service" | ||
} | ||
|
||
return grpc.Errorf(codes.InvalidArgument, "secret '%s' is in use by the following %s: %v", serviceStr, secretName, strings.Join(services, ", ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the wrong order of arguments.
1611080
to
78ecc87
Compare
Thanks for the update, @diogomonica The pattern you're using (View(check references) ; Update(delete secret)) is the one we've been using all over (e.g. networks) so it makes sense to follow the same one here for consistency and the following comment has nothing to do with this PR, it's more a SwarmKit general question: What is worse, risking inconsistencies or holding down the update lock for too long? There are two possibilities:
The perfect (and long term) solution is making @aaronlehmann You know the implications of holding the store lock for too long, what do you think? |
@aluzzardi I used pattern 2 in #1725, but for delete, iterating through all the services while locked might be a bit too much (chatting with @aaronlehmann he thought so too). I'm fine with either, I don't care either way. |
I guess we can change this to do things consistently. Holding the store lock just blocks all other transactions. Listing all services generally shouldn't take long - in extreme cases maybe tens or hundreds of ms? A leadership election blocks things for a lot longer. People probably won't remove secrets very often. But listing all services or tasks isn't really a good pattern, so we should only do this where absolutely necessary. |
@aluzzardi @aaronlehmann your call. Either merge it or tell me to change it. |
Sounds good - let's favor consistency over performance then. @diogomonica, would you mind changing this to perform the check in @mavenugo @mrjana Whenever you get a chance, could you do the same for networks? (as in, move the conflict check into the strongly consistent |
78ecc87
to
4a45d67
Compare
Signed-off-by: Diogo Monica <diogo.monica@gmail.com>
6259e60
to
fe032ff
Compare
// - Returns an error if the deletion fails. | ||
func (s *Server) RemoveSecret(ctx context.Context, request *api.RemoveSecretRequest) (*api.RemoveSecretResponse, error) { | ||
if request.SecretID == "" { | ||
return nil, grpc.Errorf(codes.InvalidArgument, "secret ID must be provided") | ||
} | ||
|
||
err := s.store.Update(func(tx store.Tx) error { | ||
// Check inf the secret exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
fe032ff
to
6588784
Compare
Signed-off-by: Diogo Monica <diogo.monica@gmail.com>
6588784
to
f67b6d7
Compare
LGTM |
1 similar comment
LGTM |
Added best-effort verification of secrets in use before Removal, unless Force is specified.
/cc @cyli @aaronlehmann