-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not block config updates if we cannot not set up the status subsystem #2005
Conversation
a03c65d
to
458151d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but I think that either unit testing or integration testing is important here.
I'd suggest that we figure out the optimal (also cost-effective) testing approach and merge this when we have this appropriately tested for the future.
👍
I think we would need to add some functionality to our metallb addon that would cause it to ignore certain services, I'm not aware off the top of my head the configuration for that but if it did not exist for some reason it would be reasonably straightforward to add upstream. In the short term, we could add an E2E style test (or expand an existing E2E test) where turning off metallb is more practical because each test case gets its own cluster.
While you're thinking about refactors I just wanted to point out #1492 for your considerations. I have thought for some time that we would need to rewrite the entire status condition handling code in time. I would argue rather than refactoring heavily we should be reimplementing the status conditions as an aspect of controllers themselves.
Good catch I would like to see a follow up issue for this so we tackle it separately. |
Ignoring specific services doesn't really help since this test is only relevant to the proxy service. May as well just turn it off entirely. I'm loathe to have something that slow for what's essentially a single regression test, but 🤷 refactoring to allow for a convoluted test would be painful, especially for something we intend to rework. 1cf9bba attempts this but rapidly runs into two problems:
|
Yes that would be the workflow. It would seem to call for the following
Not yet, but I believe that would be a trivial |
Back to draft pending Kong/kubernetes-testing-framework#149 The changes in 0e5d6f0 allow for E2E tests with custom images and using NodePorts in the event that a LoadBalancer has no IPs (or if the Service is explicitly a NodePort). It does not support automatically building the image from the current branch and loading that; you must build/tag the image yourself and specify it manually. IMO good enough to demonstrate the change viability for now--image build automation would be necessary for CI, but would require a bunch of CI work to make work. Manual local run results:
|
d6e0faa
to
57bfa5b
Compare
57bfa5b
to
caafc4e
Compare
The initial status update loop setup function, ctrlutils.PullConfigUpdate, sets up a channel select that receives ConfigDone events. The config update loop sends to this channel, and blocks if nothing can receive from it. Previously, a failure to build the client configuration necessary to update status would exit PullConfigUpdate. Failure to retrieve the publish service address(es) would block PullConfigUpdate before it began receiving events, and environments that never receive addresses (LoadBalancer Services in clusters that cannot provision LoadBalancers) would block indefinitely. If either of these occurred, The config update loop would run once, block, and never run again. This change avoids this deadlock by having PullConfigUpdate always begin its channel receive. The ConfigDone receiver attempts to initialize configuration and status information if they are not marked ready, and marks them ready if it succeeds. If both are ready, it updates status. If one or the other is not ready, it logs a debug-level message and does nothing.
Add a test that uses a cluster without a load balancer. Add support for overriding the KIC image in E2E tests. Loads an arbitrary image into the E2E test cluster and uses kustomize to replace that image in test manifests. Add support for NodePorts and no-IP LoadBalancers (via their NodePort) when attempting to verify an Ingress.
Updated with the actual KTF release and rebased onto main with the previous smaller changes squashed. Now just the commit that fixes status handling and the new E2E test. The E2E tests normally only run on tags, with the expectation that the manifests in that tag will use an image built for that tag. That's true if you're doing a release (and it's what we want for releases, to test the actual artifacts), but not otherwise. To handle this, the test updates add a shim to see if you've specified an override via the environment, rewrites the stock manifest to use that image, loads the image, and deploys the modified manifest. Creating the image remains up to the user: this does not add CI logic for building a test image from the current branch and overriding to that. You need to manually build images before you manually run E2E tests with an override. CI will eventually test the changes whenever they're tagged. test.txt is the manual test I ran for this branch. Though not shown, TestDeployAllInOneDBLESSNoLoadBalancer will fail when run with any current released image. |
Co-authored-by: Michał Flendrich <michal@flendrich.pro>
Re-run following changes. Istio result is garbage since I don't run an SSH agent. Ignoring it tests.txt. Re-running that separately, but don't expect it to have issues since it's a separate codebase than the all in one tests--still in Ed: what a noisy test istio.txt |
The integration tests have failed twice now. The test introduced in 12099e8#diff-23f82f96580f3a9696df4886f9539c8b88cf3a3b2297b71c197e417f9d24e7b3 appears to be flaky and unrelated to this change:
#2029 should deal with this, though it's a bit hard to prove. It's run in CI twice without reproducing the failure observed here. |
* fix(status) do not block config updates on failure The initial status update loop setup function, ctrlutils.PullConfigUpdate, sets up a channel select that receives ConfigDone events. The config update loop sends to this channel, and blocks if nothing can receive from it. Previously, a failure to build the client configuration necessary to update status would exit PullConfigUpdate. Failure to retrieve the publish service address(es) would block PullConfigUpdate before it began receiving events, and environments that never receive addresses (LoadBalancer Services in clusters that cannot provision LoadBalancers) would block indefinitely. If either of these occurred, The config update loop would run once, block, and never run again. This change avoids this deadlock by having PullConfigUpdate always begin its channel receive. The ConfigDone receiver attempts to initialize configuration and status information if they are not marked ready, and marks them ready if it succeeds. If both are ready, it updates status. If one or the other is not ready, it logs a debug-level message and does nothing. * test(e2e) verify clusters without LB providers Add a test that uses a cluster without a load balancer. Add support for overriding the KIC image in E2E tests. Loads an arbitrary image into the E2E test cluster and uses kustomize to replace that image in test manifests. Add support for NodePorts and no-IP LoadBalancers (via their NodePort) when attempting to verify an Ingress. Co-authored-by: Michał Flendrich <michal@flendrich.pro>
What this PR does / why we need it:
Moves the status subsystem setup (provisioning clients and retrieving publish service addresses) inside the channel receiver that handles updates from the proxy and add readiness flags to indicate if setup has completed successfully. If setup is marked not ready, attempt setup and mark the result. If setup is still unready, the status update handler does nothing. If setup is ready, process status updates.
This avoids an issue where config updates deadlocked after the first loop. The update loop attempts to send updates through the ConfigDone channel that the status loop (should) receive. Prior to this change, failure to initialize the status subsystem meant it would never receive, and the update loop would block indefinitely.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #2001Special notes for your reviewer:
RunningAddresses()
was public but had no apparent reason to be. It now returns a private type, so I made it private.newStatusConfig()
andrunningAddresses()
, mock those to always fail, and then do something convoluted with the channel--I suppose you could runPullConfigUpdate()
with mocked always fail versions, have the test send to the channel with a short timeout, and fail if it exceeds the timeout?PR Readiness Checklist:
Complete these before marking the PR as
ready to review
:CHANGELOG.md
release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR