-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix configsync for static -> dynamic -> static #54
Conversation
@@ -91,12 +91,6 @@ func NewRanch(config string, s *Storage, ttl time.Duration) (*Ranch, error) { | |||
requestMgr: NewRequestManager(ttl), | |||
now: time.Now, | |||
} | |||
if config != "" { | |||
if err := newRanch.SyncConfig(config); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This races with the cache startup and will fail if the cache wasn't started in time
cmd/boskos/boskos.go
Outdated
panic(fmt.Sprintf("BUG: expected *crds.ResourceObject, got %T", e.ObjectNew)) | ||
} | ||
|
||
return resource.Status.State == common.Tombstone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because boskos is the entity that deletes Tombstoned resources:
Line 291 in eea29b1
if r.Status.State == common.Tombstone { |
The process is:
- Boskos sets to ToBeDeleted
- Cleaner cleans and sets to Tombstone
- Boskos deletes
To me it seems that the cleaner should just delete them. The only additional thing we do here is to check that they have no owner, but they will always have no owner because the cleaner always does both clean, then release:
Lines 88 to 90 in eea29b1
c.recycleOne(res) | |
if err := c.client.ReleaseOne(res.Name, common.Tombstone); err != nil { | |
logrus.WithError(err).Errorf("failed to release %s as %s", res.Name, common.Tombstone) |
Are there more reasons or can we (in a follow-up) make the cleaner just delete the resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me. There may have been some reason this was separated that got removed in subsequent refactorings.
eea29b1
to
d01038c
Compare
d01038c
to
92a6d73
Compare
92a6d73
to
0125101
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Generally LG, just a few questions/comments.
cmd/boskos/boskos.go
Outdated
panic(fmt.Sprintf("BUG: expected *crds.ResourceObject, got %T", e.ObjectNew)) | ||
} | ||
|
||
return resource.Status.State == common.Tombstone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me. There may have been some reason this was separated that got removed in subsequent refactorings.
if err := s.client.Delete(s.ctx, o); err != nil { | ||
return err | ||
} | ||
if err := wait.Poll(100*time.Millisecond, 5*time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do this? and where did the value of 5s come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client is cache-backed, which means that it takes a moment for our own changes to be visible for us. This is important here, because otherwise we will a moment later get the drlc, not see this change, exclude the now-static resources from the number of static resources created for this dlrc, note that we have too little, create them and then a moment later attempt do delete all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 100ms and 5s are relatively arbitrary, under normal circumstances the cache will lag between 200ms to 2s behind.
ranch/storage.go
Outdated
// Mark for deletion of all associated dynamic resources. | ||
existingDRLC.Spec.MinCount = 0 | ||
existingDRLC.Spec.MaxCount = 0 | ||
dRLCToUpdate = append(dRLCToUpdate, existingDRLC) | ||
if err := s.client.Update(s.ctx, existingDRLC.DeepCopyObject()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to do this update in the loop?
(Partially I'm wondering why we treat this as a special update, rather than just passing the DRLC to deleteDynamicResourceLifecycles, though I guess we don't have the list of static resources here. Hm.)
I guess it feels like this code is a bit messy and needs some cleaning up, though that does predate this PR. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs to be in the loop because only for this change we must make sure to again have it in the cache before proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it feels like this code is a bit messy and needs some cleaning up, though that does predate this PR. Thoughts?
Certainly, for example the approach here of constructing a slice to then pass that to another func that serially updates all elements in the slice is just needlessly complex and adds a couple of allocations, just directly updating here would be simpler. But the change here already has an about 400 line diff, so I wanted to not make that worse
Currently we reconcile the config on file update or the dynamic part periodically. This doesn't catch all cases, for example if non-dynamic resources need to be deleted but are currently in use, this will only happen when the config file gets changed. Additionally, its slow as it happens only periodicially. This change moves that into a simplistic controller that reacts to config file, resource object and dynamic resource object changes.
0125101
to
ee96ae3
Compare
ranch/storage.go
Outdated
errs = append(errs, err) | ||
continue | ||
} | ||
// Make sure we have this change in our cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just put this logic in UpdateDynamicResourceLifeCycle
, and then reduce some of the code-duplication? (I guess figuring out the right condition to check is tricky...)
couldn't we run into similar issues where an update to a DRLC (e.g. changing the mincount/maxcount to something nonzero) results in us creating and then immediately deleting a dynamic resource?
I guess it might increase the latency for syncs, but hopefully not by much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (We have this logic twice, once in syncDynamicResourceLifeCycles and once in UpdateAllDynamicResources. We should probably unexport the latter and remove those bits there, as they are redundant now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you replace the duplicated logic in this function now with a call to UpdateDynamicResourceLifeCycle
? i.e. replace the s.client.Update
call (after setting MinCount and MaxCount to 0) with UpdateDynamicResourceLifeCycle(&dRLCToDelelete[idx])
?
(also I just noticed that typo on Delete
in dRLCToDelelete
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and fixed the typo
ee96ae3
to
20c726f
Compare
20c726f
to
3834f37
Compare
@ixdy any chance you can have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the long review latency on this one. LGTM!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, ixdy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…e, and GCP by region"" This reverts commit 8a22fc4, openshift#12842. Boskos' config reloading on dynamic -> static pivots has been fixed by kubernetes-sigs/boskos@3834f37d8a (Config sync: Avoid deadlock when static -> dynamic -> static, 2020-12-03, kubernetes-sigs/boskos#54), so we can take another run at static leases for these platforms. Not a clean re-revert, because 4705f26 (core-services/prow/02_config: Drop GCP Boskos leases to 80, 2020-12-02, openshift#14032) landed in the meantime, but it was easy to update from 120 to 80 here.
First commit: Make config sync a controller
Tombstome
stateSecond commit: Fix config sync when a config is changed from static -> dynamic -> static
This currently causes an
IsAlreadyExists
error because the calculation for the static resource creation ignores all resources that have a corresponding dynamic config. This blocks any further progress.-> Fixed by including all resources that have a static config in the static resource add/delete calculation
-> This made the dynamic resource reconciliation instantly delete them. Fixed by making the dynamic resource reconciliation ignore all resources that have a static config
Note: While I added a unit testcase for this specific problem, I verified the pull by deploying boskos and changing its config from (for the same resource) static -> dynamic -> static -> dyanamic -> static in order to make sure that all of reconciliation and triggering work correctly and that there are no errors.
/assign @ixdy