Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve approval reconciler timings #797

Merged
merged 3 commits into from
Aug 22, 2024

Conversation

efiacor
Copy link
Contributor

@efiacor efiacor commented Aug 15, 2024

Change approval controller PR Get to hit the api directly instead of reading from local cache.
Adjust the reque duration to prevent race condition.

During debugging the approval delay issue reported here it became apparent that the packagerev being fetched was a cached version which didn't get updated for quite some time.
To circumvent this, we are retrieving the PR using the apiReader interface which bypasses the local cache and hits the k8s api directly.

@efiacor
Copy link
Contributor Author

efiacor commented Aug 15, 2024

#462

@@ -47,7 +47,7 @@ const (
DelayAnnotationName = "approval.nephio.org/delay"
PolicyAnnotationName = "approval.nephio.org/policy"
InitialPolicyAnnotationValue = "initial"
RequeueDuration = 10 * time.Second
RequeueDuration = 15 * time.Second

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiacor What does it signify? Should this value be statically calculated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that 15 is the right value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you propose that we statically calculate it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, how it is working or in the first place what are the factors which govern it.
You changed it to 15 instead of 10 so I thought you would know the reason why exactly this value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the orig change was to address the removal of some watch events from Porch.
#452

With this change, the retrieval of the PR readiness gates has been sped up greatly leading to another race condition I observed, especially in the 006 test (AMF/SMF deploy).

To address this, I added a delay to the said packavariants, but still I noticed intermittent issues where the AMF especially would go from approved - v1, to being reprocessed and a v2 branch getting created, thus failing the initial policy

Bumping the requeue duration seems to allow time for the pkgrev mutation to happen without it being pushed on to a v2 revision.

Overall, the auto approve does need more refinement. Some of the watch events from porch need to be looked at more closely and align this controller with them.

This fix is mainly focused on reducing the excessive "wait" timings we are seeing in the CI runs, but also, to try to focus some more attention on this controller as it may be of more use to users than the free5gc specific ones here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" I noticed intermittent issues where the AMF especially would go from approved - v1, to being reprocessed and a v2 branch getting created"

Do you know if it was the approval controller that approved the packagerevision prematurely, or was it some other actor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other actor would be the test script itself or a human. Tweaking timings is a losing battle, the overall flow should be robust to race conditions. So, I using a different approval policy makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still doing some debugging to determine the cause/source but it seems the pv controller is triggering a new "clone" when it finds a diff in the KptFile.
https://github.com/nephio-project/porch/blob/main/controllers/packagevariants/pkg/controllers/packagevariant/packagevariant_controller.go#L827
This comes from the generic specializer controller I believe.
Even with using an "always" policy I am seeing some odd behaviour. Will continue with more refinements later.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look fine to me. However, it is clear that the "initial" policy isn't really meeting our needs. I suggest we add another policy to approval controller.

See: #398 - I will add another comment there.

@efiacor
Copy link
Contributor Author

efiacor commented Aug 21, 2024

These changes look fine to me. However, it is clear that the "initial" policy isn't really meeting our needs. I suggest we add another policy to approval controller.

See: #398 - I will add another comment there.

Perfect. I will add a new "always" policy in the next PR.

@johnbelamaric
Copy link
Member

/approve
/lgtm

Copy link
Contributor

nephio-prow bot commented Aug 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, kushnaidu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nephio-prow nephio-prow bot merged commit 8bb1efd into nephio-project:main Aug 22, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants