-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve approval reconciler timings #797
Improve approval reconciler timings #797
Conversation
@@ -47,7 +47,7 @@ const ( | |||
DelayAnnotationName = "approval.nephio.org/delay" | |||
PolicyAnnotationName = "approval.nephio.org/policy" | |||
InitialPolicyAnnotationValue = "initial" | |||
RequeueDuration = 10 * time.Second | |||
RequeueDuration = 15 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiacor What does it signify? Should this value be statically calculated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.17.2/pkg/reconcile#Result
How do you mean, statically calculated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that 15 is the right value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you propose that we statically calculate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, how it is working or in the first place what are the factors which govern it.
You changed it to 15 instead of 10 so I thought you would know the reason why exactly this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the orig change was to address the removal of some watch events from Porch.
#452
With this change, the retrieval of the PR readiness gates has been sped up greatly leading to another race condition I observed, especially in the 006 test (AMF/SMF deploy).
To address this, I added a delay to the said packavariants, but still I noticed intermittent issues where the AMF especially would go from approved - v1, to being reprocessed and a v2 branch getting created, thus failing the initial policy
Bumping the requeue duration seems to allow time for the pkgrev mutation to happen without it being pushed on to a v2 revision.
Overall, the auto approve does need more refinement. Some of the watch events from porch need to be looked at more closely and align this controller with them.
This fix is mainly focused on reducing the excessive "wait" timings we are seeing in the CI runs, but also, to try to focus some more attention on this controller as it may be of more use to users than the free5gc specific ones here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" I noticed intermittent issues where the AMF especially would go from approved - v1, to being reprocessed and a v2 branch getting created"
Do you know if it was the approval controller that approved the packagerevision prematurely, or was it some other actor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other actor would be the test script itself or a human. Tweaking timings is a losing battle, the overall flow should be robust to race conditions. So, I using a different approval policy makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still doing some debugging to determine the cause/source but it seems the pv controller is triggering a new "clone" when it finds a diff in the KptFile.
https://github.com/nephio-project/porch/blob/main/controllers/packagevariants/pkg/controllers/packagevariant/packagevariant_controller.go#L827
This comes from the generic specializer controller I believe.
Even with using an "always" policy I am seeing some odd behaviour. Will continue with more refinements later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look fine to me. However, it is clear that the "initial" policy isn't really meeting our needs. I suggest we add another policy to approval controller.
See: #398 - I will add another comment there.
Perfect. I will add a new "always" policy in the next PR. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric, kushnaidu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Change approval controller PR Get to hit the api directly instead of reading from local cache.
Adjust the reque duration to prevent race condition.
During debugging the approval delay issue reported here it became apparent that the packagerev being fetched was a cached version which didn't get updated for quite some time.
To circumvent this, we are retrieving the PR using the apiReader interface which bypasses the local cache and hits the k8s api directly.