-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for two different kinds of consistency issues #283
Conversation
This commit is intended to address two issues that we diagnosed while investigating crossplane-contrib/provider-aws#802. The first issue is that controller-runtime does not guarantee reads from cache will return the freshest version of a resource. It's possible we could create an external resource in one reconcile, then shortly after trigger another in which it appears that the managed resource was never created because we didn't record its external-name. This only affects the subset of managed resources with non-deterministic external-names that are assigned during creation. The second issue is that some external APIs are eventually consistent. A newly created external resource may take some time before our ExternalClient's observe call can confirm it exists. AWS EC2 is an example of one such API. This commit attempts to address the first issue by making an Update to a managed resource immediately before Create it called. This Update call will be rejected by the API server if the managed resource we read from cache was not the latest version. It attempts to address the second issue by allowing managed resource controller authors to configure an optional grace period that begins when an external resource is successfully created. During this grace period we'll requeue and keep waiting if Observe determines that the external resource doesn't exist, rather than (re)creating it. Signed-off-by: Nic Cope <negz@rk0n.org>
Interestingly I did see the above case once in the logs from my 12-13 hours of testing. It seems to have resolved itself (the RouteTable was cleaned up successfully. I'm currently experiencing some login issues so I can't actually confirm whether or not the RouteTable was leaked, but I imagine it wasn't because if it was it would have blocked deletion of the VPC managed resource it existed in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing this hard problem with the least code change necessary in providers!
The retry logic we use to persist critical annotations makes it difficult to delete an annotation without potentially also deleting annotations added by another controller (e.g. the composition logic). This commit therefore changes the way we detect whether we might have created an external resource but not recorded the result. Previously we relied on the presence of the 'pending' annotation to detect this state. Now we check whether the 'pending' annotation is newer than any 'succeeded' or 'failed' annotation. Signed-off-by: Nic Cope <negz@rk0n.org>
Nice, looks good! |
/backport |
/backport (Now that the commands workflow actually exists in this repo.) |
Successfully created backport PR #288 for |
/backport This time for the v0.13 and v0.14 releases. |
Backport failed for Please cherry-pick the changes locally. git fetch origin release-0.13
git worktree add -d .worktree/backport-283-to-release-0.13 origin/release-0.13
cd .worktree/backport-283-to-release-0.13
git checkout -b backport-283-to-release-0.13
ancref=$(git merge-base 589072e678f8d9b4db515bbc7c7f1e129305d6ce 8e780ecd6d30f0a5e024f047311c3ededa915b8d)
git cherry-pick -x $ancref..8e780ecd6d30f0a5e024f047311c3ededa915b8d |
Successfully created backport PR #289 for |
These annotations were introduced in crossplane/crossplane-runtime#283. Per crossplane/crossplane#3037 folks find these annotations hard to reason about. That's understandable, because they're doing a lot of subtle things. This section ended up super long, but I think this is an area where folks really need to understand what's happening in order to make good decisions when Crossplane refuses to proceed. Signed-off-by: Nic Cope <nicc@rk0n.org>
Description of your changes
Fixes crossplane-contrib/provider-aws#802
Closes #279
Closes #280
This PR is intended to address two issues that we diagnosed while investigating crossplane-contrib/provider-aws#802. It addresses feedback from and supercedes #279 and #280. Both issues are known to cause 'leaked' and duplicate managed resources. I have only been able to personally reproduce the second issue but we have evidence of the first happening in the wild.
The first issue is that controller-runtime does not guarantee reads from cache will return the freshest version of a resource. It's possible we could create an external resource in one reconcile, then shortly after trigger another in which it appears that the managed resource was never created because we didn't record its external-name. This only affects the subset of managed resources with non-deterministic external-names that are assigned during creation.
The second issue is that some external APIs are eventually consistent. A newly created external resource may take some time before our ExternalClient's observe call can confirm it exists. AWS EC2 is an example of one such API.
This PR attempts to address the first issue by making an Update to a managed resource immediately before Create it called. This Update call will be rejected by the API server if the managed resource we read from cache was not the latest version.
It attempts to address the second issue by allowing managed resource controller authors to configure an optional grace period that begins when an external resource is successfully created. During this grace period we'll requeue and keep waiting if Observe determines that the external resource doesn't exist, rather than (re)creating it.
I have:
make reviewable
to ensure this PR is ready for review.How has this code been tested
https://gist.github.com/negz/e1f2e74f18802d15440214a1a1abc981
I've run the script from the above gist for about 12 hours against provider-aws built with this PR and observed zero leaked resources.