Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not block config updates if we cannot not set up the status subsystem #2005

Merged
merged 9 commits into from
Dec 1, 2021

Conversation

rainest
Copy link
Contributor

@rainest rainest commented Nov 9, 2021

What this PR does / why we need it:
Moves the status subsystem setup (provisioning clients and retrieving publish service addresses) inside the channel receiver that handles updates from the proxy and add readiness flags to indicate if setup has completed successfully. If setup is marked not ready, attempt setup and mark the result. If setup is still unready, the status update handler does nothing. If setup is ready, process status updates.

This avoids an issue where config updates deadlocked after the first loop. The update loop attempts to send updates through the ConfigDone channel that the status loop (should) receive. Prior to this change, failure to initialize the status subsystem meant it would never receive, and the update loop would block indefinitely.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #2001

Special notes for your reviewer:

  • RunningAddresses() was public but had no apparent reason to be. It now returns a private type, so I made it private.
  • Integration tests for this would require testing in an environment that cannot provision LBs. I think KTF can do this, but it does require a separate test step (I think--something separate anyway, it's mutually exclusive with our current environment).
  • Unit tests are maybe doable, but I think you'd need to refactor the status subsystem to define an interface for at least newStatusConfig() and runningAddresses(), mock those to always fail, and then do something convoluted with the channel--I suppose you could run PullConfigUpdate() with mocked always fail versions, have the test send to the channel with a short timeout, and fail if it exceeds the timeout?
  • While writing this I also confirmed how we were handling changes to the publish service in the status subsystem outside the controller reconcilers (where we will, in the future, handle updates to the GWAPI equivalent of publish services). It turns out that we don't handle this at all: we retrieve the publish services during init, save them, and then never update them again as far as I can tell (though we do have logic to update Ingress-likes if the addresses change!). This PR does not seek to address that, though it's presumably something we want to handle properly. Handling it here seems like it'd require just getting the publish service each time, which seems inefficient. I'm curious if it'd make sense to both move that into a watch somewhere and/or have a controller that handles those updates for both GWAPI and Ingress.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

  • the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

@rainest rainest requested a review from a team as a code owner November 9, 2021 23:51
@rainest rainest temporarily deployed to Configure ci November 9, 2021 23:51 Inactive
@rainest rainest force-pushed the fix/status-noblock branch from a03c65d to 458151d Compare November 9, 2021 23:53
@rainest rainest temporarily deployed to Configure ci November 9, 2021 23:53 Inactive
@rainest rainest temporarily deployed to Configure ci November 10, 2021 00:03 Inactive
Copy link
Contributor

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I think that either unit testing or integration testing is important here.

I'd suggest that we figure out the optimal (also cost-effective) testing approach and merge this when we have this appropriately tested for the future.

internal/ctrlutils/ingress-status.go Outdated Show resolved Hide resolved
@shaneutt
Copy link
Contributor

RunningAddresses() was public but had no apparent reason to be. It now returns a private type, so I made it private.

👍

Integration tests for this would require testing in an environment that cannot provision LBs. I think KTF can do this, but it does require a separate test step (I think--something separate anyway, it's mutually exclusive with our current environment).

I think we would need to add some functionality to our metallb addon that would cause it to ignore certain services, I'm not aware off the top of my head the configuration for that but if it did not exist for some reason it would be reasonably straightforward to add upstream.

In the short term, we could add an E2E style test (or expand an existing E2E test) where turning off metallb is more practical because each test case gets its own cluster.

Unit tests are maybe doable, but I think you'd need to refactor the status subsystem to define an interface for at least newStatusConfig() and runningAddresses(), mock those to always fail, and then do something convoluted with the channel--I suppose you could run PullConfigUpdate() with mocked always fail versions, have the test send to the channel with a short timeout, and fail if it exceeds the timeout?

While you're thinking about refactors I just wanted to point out #1492 for your considerations. I have thought for some time that we would need to rewrite the entire status condition handling code in time. I would argue rather than refactoring heavily we should be reimplementing the status conditions as an aspect of controllers themselves.

While writing this I also confirmed how we were handling changes to the publish service in the status subsystem outside the controller reconcilers (where we will, in the future, handle updates to the GWAPI equivalent of publish services). It turns out that we don't handle this at all: we retrieve the publish services during init, save them, and then never update them again as far as I can tell (though we do have logic to update Ingress-likes if the addresses change!). This PR does not seek to address that, though it's presumably something we want to handle properly. Handling it here seems like it'd require just getting the publish service each time, which seems inefficient. I'm curious if it'd make sense to both move that into a watch somewhere and/or have a controller that handles those updates for both GWAPI and Ingress.

Good catch I would like to see a follow up issue for this so we tackle it separately.

@rainest rainest temporarily deployed to Configure ci November 10, 2021 22:45 Inactive
@rainest rainest temporarily deployed to Configure ci November 10, 2021 22:54 Inactive
@rainest rainest temporarily deployed to Configure ci November 10, 2021 23:11 Inactive
@rainest rainest temporarily deployed to Configure ci November 10, 2021 23:20 Inactive
@rainest
Copy link
Contributor Author

rainest commented Nov 10, 2021

I think we would need to add some functionality to our metallb addon that would cause it to ignore certain services ... In the short term, we could add an E2E style test (or expand an existing E2E test) where turning off metallb is more practical because each test case gets its own cluster.

Ignoring specific services doesn't really help since this test is only relevant to the proxy service. May as well just turn it off entirely. I'm loathe to have something that slow for what's essentially a single regression test, but 🤷 refactoring to allow for a convoluted test would be painful, especially for something we intend to rework.

1cf9bba attempts this but rapidly runs into two problems:

  • E2E tests against manifests and by extension existing images. Is there an existing way to trial them against a branch? I'm guessing you would need to build an image, load it into KIND, and then patch the manifests to use the temporary image?
  • The tests rely on the LB IP to test behavior, and this removes that. Does KTF have a way to expose the node IP for NodePorts instead? https://github.com/gianarb/kube-port-forward exists as an alternative, but I'd prefer to avoid using a one-off PoC.

@rainest rainest temporarily deployed to Configure ci November 10, 2021 23:23 Inactive
@rainest rainest temporarily deployed to Configure ci November 10, 2021 23:33 Inactive
@shaneutt
Copy link
Contributor

E2E tests against manifests and by extension existing images. Is there an existing way to trial them against a branch? I'm guessing you would need to build an image, load it into KIND, and then patch the manifests to use the temporary image?

Yes that would be the workflow. It would seem to call for the following

  • the kind.Builder should have a WithContainerImageFiles(files ...string) option to identify image archives and load them prior to loading addons.
  • the E2E tests for this should deploy using the kustomize manifests instead of the all in one, to allow image specification

The tests rely on the LB IP to test behavior, and this removes that. Does KTF have a way to expose the node IP for NodePorts instead? https://github.com/gianarb/kube-port-forward exists as an alternative, but I'd prefer to avoid using a one-off PoC.

Not yet, but I believe that would be a trivial WithProxyServiceNodePort() to add the to KTF and wire that up to the helm values underneath.

@rainest rainest marked this pull request as draft November 12, 2021 01:19
@rainest
Copy link
Contributor Author

rainest commented Nov 12, 2021

Back to draft pending Kong/kubernetes-testing-framework#149

The changes in 0e5d6f0 allow for E2E tests with custom images and using NodePorts in the event that a LoadBalancer has no IPs (or if the Service is explicitly a NodePort). It does not support automatically building the image from the current branch and loading that; you must build/tag the image yourself and specify it manually. IMO good enough to demonstrate the change viability for now--image build automation would be necessary for CI, but would require a bunch of CI work to make work. Manual local run results:

17:06:19-0800 esenin $ TEST_KONG_CONTROLLER_IMAGE_OVERRIDE="kictest:2.0.5-test" GOFLAGS="-tags=e2e_tests" go test -v \
        -race \
        -parallel 8 \
        -timeout 30m \
        ./test/e2e/... -test.run TestDeployAllInOneDBLESSNoLoadBalancer
17:06:43-0800 esenin $ TEST_KONG_CONTROLLER_IMAGE_OVERRIDE="kictest:2.0.5-test" GOFLAGS="-tags=e2e_tests" go test -v \
        -race \
        -parallel 8 \
        -timeout 30m \
        ./test/e2e/... -test.run TestDeployAllInOneDBLESSNoLoadBalancer
=== RUN   TestDeployAllInOneDBLESSNoLoadBalancer
    all_in_one_test.go:106: configuring all-in-one-dbless.yaml manifest test
=== PAUSE TestDeployAllInOneDBLESSNoLoadBalancer
=== CONT  TestDeployAllInOneDBLESSNoLoadBalancer
    all_in_one_test.go:111: building test cluster and environment
    all_in_one_test.go:125: deploying kong components
    all_in_one_test.go:556: using modified ../../deploy/single/all-in-one-dbless.yaml manifest at /tmp/kictest.909394191
    all_in_one_test.go:269: creating a tempfile for kubeconfig
    all_in_one_test.go:277: dumping kubeconfig to tempfile
    all_in_one_test.go:282: waiting for testing environment to be ready
    all_in_one_test.go:285: creating the kong namespace
    all_in_one_test.go:293: deploying any supplemental secrets (found: 0)
    all_in_one_test.go:299: deploying the /tmp/kictest.909394191 manifest to the cluster
    all_in_one_test.go:306: waiting for kong to be ready
    all_in_one_test.go:128: running ingress tests to verify all-in-one deployed ingress controller and proxy are functional
    all_in_one_test.go:315: deploying an HTTP service to test the ingress controller and proxy
    all_in_one_test.go:321: exposing deployment httpbin via service
    all_in_one_test.go:326: creating an ingress for service httpbin with ingress.class kong
    all_in_one_test.go:335: finding the kong proxy service ip
    all_in_one_test.go:359: waiting for route from Ingress to be operational at http://172.18.0.2:30620/httpbin
--- PASS: TestDeployAllInOneDBLESSNoLoadBalancer (182.86s)

@rainest rainest force-pushed the fix/status-noblock branch 2 times, most recently from d6e0faa to 57bfa5b Compare November 19, 2021 23:02
@rainest rainest temporarily deployed to Configure ci November 19, 2021 23:02 Inactive
@rainest rainest temporarily deployed to Configure ci November 19, 2021 23:07 Inactive
@rainest rainest marked this pull request as ready for review November 19, 2021 23:08
@rainest rainest requested a review from mflendrich November 19, 2021 23:08
Travis Raines added 2 commits November 19, 2021 15:09
The initial status update loop setup function,
ctrlutils.PullConfigUpdate, sets up a channel select that receives
ConfigDone events. The config update loop sends to this channel, and
blocks if nothing can receive from it.

Previously, a failure to build the client configuration necessary to
update status would exit PullConfigUpdate. Failure to retrieve the
publish service address(es) would block PullConfigUpdate before it began
receiving events, and environments that never receive addresses
(LoadBalancer Services in clusters that cannot provision LoadBalancers)
would block indefinitely. If either of these occurred, The config update
loop would run once, block, and never run again.

This change avoids this deadlock by having PullConfigUpdate always begin
its channel receive. The ConfigDone receiver attempts to initialize
configuration and status information if they are not marked ready, and
marks them ready if it succeeds. If both are ready, it updates status.
If one or the other is not ready, it logs a debug-level message and does
nothing.
Add a test that uses a cluster without a load balancer.

Add support for overriding the KIC image in E2E tests. Loads an
arbitrary image into the E2E test cluster and uses kustomize to replace
that image in test manifests.

Add support for NodePorts and no-IP LoadBalancers (via their NodePort)
when attempting to verify an Ingress.
@rainest
Copy link
Contributor Author

rainest commented Nov 19, 2021

Updated with the actual KTF release and rebased onto main with the previous smaller changes squashed. Now just the commit that fixes status handling and the new E2E test.

The E2E tests normally only run on tags, with the expectation that the manifests in that tag will use an image built for that tag. That's true if you're doing a release (and it's what we want for releases, to test the actual artifacts), but not otherwise. To handle this, the test updates add a shim to see if you've specified an override via the environment, rewrites the stock manifest to use that image, loads the image, and deploys the modified manifest.

Creating the image remains up to the user: this does not add CI logic for building a test image from the current branch and overriding to that. You need to manually build images before you manually run E2E tests with an override. CI will eventually test the changes whenever they're tagged. test.txt is the manual test I ran for this branch. Though not shown, TestDeployAllInOneDBLESSNoLoadBalancer will fail when run with any current released image.

@rainest rainest temporarily deployed to Configure ci November 19, 2021 23:26 Inactive
Travis Raines and others added 2 commits November 22, 2021 08:57
Co-authored-by: Michał Flendrich <michal@flendrich.pro>
@rainest
Copy link
Contributor Author

rainest commented Nov 22, 2021

Re-run following changes. Istio result is garbage since I don't run an SSH agent. Ignoring it tests.txt. Re-running that separately, but don't expect it to have issues since it's a separate codebase than the all in one tests--still in e2e but it doesn't use the deployKong() helper.

Ed: what a noisy test istio.txt

@rainest rainest temporarily deployed to Configure ci November 22, 2021 18:03 Inactive
@rainest
Copy link
Contributor Author

rainest commented Nov 22, 2021

The integration tests have failed twice now. The test introduced in 12099e8#diff-23f82f96580f3a9696df4886f9539c8b88cf3a3b2297b71c197e417f9d24e7b3 appears to be flaky and unrelated to this change:

  • The test is new.
  • The tested component (the admission server) behaves identically for both Postgres and DB-less systems. It failed once on DB-less (which prevented Postgres from running) and succeeded once on DB-less, but then failed in Postgres on that run.
  • The failure is due to a missing resource in the Kubernetes API store. I suspect we just need to wait, it's possible there's some lag between the credential create succeeding and it being actually available.

#2029 should deal with this, though it's a bit hard to prove. It's run in CI twice without reproducing the failure observed here.

@mflendrich mflendrich temporarily deployed to Configure ci December 1, 2021 13:21 Inactive
@mflendrich mflendrich temporarily deployed to Configure ci December 1, 2021 14:34 Inactive
@mflendrich mflendrich temporarily deployed to Configure ci December 1, 2021 14:46 Inactive
@rainest rainest enabled auto-merge (squash) December 1, 2021 16:19
@rainest rainest temporarily deployed to Configure ci December 1, 2021 16:19 Inactive
@rainest rainest merged commit 9b33faa into main Dec 1, 2021
@rainest rainest deleted the fix/status-noblock branch December 1, 2021 16:32
@rainest rainest temporarily deployed to Configure ci December 1, 2021 16:32 Inactive
rainest pushed a commit that referenced this pull request Jan 18, 2022
* fix(status) do not block config updates on failure

The initial status update loop setup function,
ctrlutils.PullConfigUpdate, sets up a channel select that receives
ConfigDone events. The config update loop sends to this channel, and
blocks if nothing can receive from it.

Previously, a failure to build the client configuration necessary to
update status would exit PullConfigUpdate. Failure to retrieve the
publish service address(es) would block PullConfigUpdate before it began
receiving events, and environments that never receive addresses
(LoadBalancer Services in clusters that cannot provision LoadBalancers)
would block indefinitely. If either of these occurred, The config update
loop would run once, block, and never run again.

This change avoids this deadlock by having PullConfigUpdate always begin
its channel receive. The ConfigDone receiver attempts to initialize
configuration and status information if they are not marked ready, and
marks them ready if it succeeds. If both are ready, it updates status.
If one or the other is not ready, it logs a debug-level message and does
nothing.

* test(e2e) verify clusters without LB providers

Add a test that uses a cluster without a load balancer.

Add support for overriding the KIC image in E2E tests. Loads an
arbitrary image into the E2E test cluster and uses kustomize to replace
that image in test manifests.

Add support for NodePorts and no-IP LoadBalancers (via their NodePort)
when attempting to verify an Ingress.

Co-authored-by: Michał Flendrich <michal@flendrich.pro>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Never-provisioned LoadBalancer blocks controller operation
3 participants