Do not block config updates if we cannot not set up the status subsystem #2005

rainest · 2021-11-09T23:50:59Z

What this PR does / why we need it:
Moves the status subsystem setup (provisioning clients and retrieving publish service addresses) inside the channel receiver that handles updates from the proxy and add readiness flags to indicate if setup has completed successfully. If setup is marked not ready, attempt setup and mark the result. If setup is still unready, the status update handler does nothing. If setup is ready, process status updates.

This avoids an issue where config updates deadlocked after the first loop. The update loop attempts to send updates through the ConfigDone channel that the status loop (should) receive. Prior to this change, failure to initialize the status subsystem meant it would never receive, and the update loop would block indefinitely.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #2001

Special notes for your reviewer:

RunningAddresses() was public but had no apparent reason to be. It now returns a private type, so I made it private.
Integration tests for this would require testing in an environment that cannot provision LBs. I think KTF can do this, but it does require a separate test step (I think--something separate anyway, it's mutually exclusive with our current environment).
Unit tests are maybe doable, but I think you'd need to refactor the status subsystem to define an interface for at least newStatusConfig() and runningAddresses(), mock those to always fail, and then do something convoluted with the channel--I suppose you could run PullConfigUpdate() with mocked always fail versions, have the test send to the channel with a short timeout, and fail if it exceeds the timeout?
While writing this I also confirmed how we were handling changes to the publish service in the status subsystem outside the controller reconcilers (where we will, in the future, handle updates to the GWAPI equivalent of publish services). It turns out that we don't handle this at all: we retrieve the publish services during init, save them, and then never update them again as far as I can tell (though we do have logic to update Ingress-likes if the addresses change!). This PR does not seek to address that, though it's presumably something we want to handle properly. Handling it here seems like it'd require just getting the publish service each time, which seems inefficient. I'm curious if it'd make sense to both move that into a watch somewhere and/or have a controller that handles those updates for both GWAPI and Ingress.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

mflendrich

Looks good but I think that either unit testing or integration testing is important here.

I'd suggest that we figure out the optimal (also cost-effective) testing approach and merge this when we have this appropriately tested for the future.

internal/ctrlutils/ingress-status.go

shaneutt · 2021-11-10T14:08:36Z

RunningAddresses() was public but had no apparent reason to be. It now returns a private type, so I made it private.

👍

Integration tests for this would require testing in an environment that cannot provision LBs. I think KTF can do this, but it does require a separate test step (I think--something separate anyway, it's mutually exclusive with our current environment).

I think we would need to add some functionality to our metallb addon that would cause it to ignore certain services, I'm not aware off the top of my head the configuration for that but if it did not exist for some reason it would be reasonably straightforward to add upstream.

In the short term, we could add an E2E style test (or expand an existing E2E test) where turning off metallb is more practical because each test case gets its own cluster.

Unit tests are maybe doable, but I think you'd need to refactor the status subsystem to define an interface for at least newStatusConfig() and runningAddresses(), mock those to always fail, and then do something convoluted with the channel--I suppose you could run PullConfigUpdate() with mocked always fail versions, have the test send to the channel with a short timeout, and fail if it exceeds the timeout?

While you're thinking about refactors I just wanted to point out #1492 for your considerations. I have thought for some time that we would need to rewrite the entire status condition handling code in time. I would argue rather than refactoring heavily we should be reimplementing the status conditions as an aspect of controllers themselves.

While writing this I also confirmed how we were handling changes to the publish service in the status subsystem outside the controller reconcilers (where we will, in the future, handle updates to the GWAPI equivalent of publish services). It turns out that we don't handle this at all: we retrieve the publish services during init, save them, and then never update them again as far as I can tell (though we do have logic to update Ingress-likes if the addresses change!). This PR does not seek to address that, though it's presumably something we want to handle properly. Handling it here seems like it'd require just getting the publish service each time, which seems inefficient. I'm curious if it'd make sense to both move that into a watch somewhere and/or have a controller that handles those updates for both GWAPI and Ingress.

Good catch I would like to see a follow up issue for this so we tackle it separately.

rainest · 2021-11-10T23:21:33Z

I think we would need to add some functionality to our metallb addon that would cause it to ignore certain services ... In the short term, we could add an E2E style test (or expand an existing E2E test) where turning off metallb is more practical because each test case gets its own cluster.

Ignoring specific services doesn't really help since this test is only relevant to the proxy service. May as well just turn it off entirely. I'm loathe to have something that slow for what's essentially a single regression test, but 🤷 refactoring to allow for a convoluted test would be painful, especially for something we intend to rework.

1cf9bba attempts this but rapidly runs into two problems:

E2E tests against manifests and by extension existing images. Is there an existing way to trial them against a branch? I'm guessing you would need to build an image, load it into KIND, and then patch the manifests to use the temporary image?
The tests rely on the LB IP to test behavior, and this removes that. Does KTF have a way to expose the node IP for NodePorts instead? https://github.com/gianarb/kube-port-forward exists as an alternative, but I'd prefer to avoid using a one-off PoC.

shaneutt · 2021-11-11T13:48:52Z

E2E tests against manifests and by extension existing images. Is there an existing way to trial them against a branch? I'm guessing you would need to build an image, load it into KIND, and then patch the manifests to use the temporary image?

Yes that would be the workflow. It would seem to call for the following

the kind.Builder should have a WithContainerImageFiles(files ...string) option to identify image archives and load them prior to loading addons.
the E2E tests for this should deploy using the kustomize manifests instead of the all in one, to allow image specification

The tests rely on the LB IP to test behavior, and this removes that. Does KTF have a way to expose the node IP for NodePorts instead? https://github.com/gianarb/kube-port-forward exists as an alternative, but I'd prefer to avoid using a one-off PoC.

Not yet, but I believe that would be a trivial WithProxyServiceNodePort() to add the to KTF and wire that up to the helm values underneath.

rainest · 2021-11-12T01:24:14Z

Back to draft pending Kong/kubernetes-testing-framework#149

The changes in 0e5d6f0 allow for E2E tests with custom images and using NodePorts in the event that a LoadBalancer has no IPs (or if the Service is explicitly a NodePort). It does not support automatically building the image from the current branch and loading that; you must build/tag the image yourself and specify it manually. IMO good enough to demonstrate the change viability for now--image build automation would be necessary for CI, but would require a bunch of CI work to make work. Manual local run results:

17:06:19-0800 esenin $ TEST_KONG_CONTROLLER_IMAGE_OVERRIDE="kictest:2.0.5-test" GOFLAGS="-tags=e2e_tests" go test -v \
        -race \
        -parallel 8 \
        -timeout 30m \
        ./test/e2e/... -test.run TestDeployAllInOneDBLESSNoLoadBalancer
17:06:43-0800 esenin $ TEST_KONG_CONTROLLER_IMAGE_OVERRIDE="kictest:2.0.5-test" GOFLAGS="-tags=e2e_tests" go test -v \
        -race \
        -parallel 8 \
        -timeout 30m \
        ./test/e2e/... -test.run TestDeployAllInOneDBLESSNoLoadBalancer
=== RUN   TestDeployAllInOneDBLESSNoLoadBalancer
    all_in_one_test.go:106: configuring all-in-one-dbless.yaml manifest test
=== PAUSE TestDeployAllInOneDBLESSNoLoadBalancer
=== CONT  TestDeployAllInOneDBLESSNoLoadBalancer
    all_in_one_test.go:111: building test cluster and environment
    all_in_one_test.go:125: deploying kong components
    all_in_one_test.go:556: using modified ../../deploy/single/all-in-one-dbless.yaml manifest at /tmp/kictest.909394191
    all_in_one_test.go:269: creating a tempfile for kubeconfig
    all_in_one_test.go:277: dumping kubeconfig to tempfile
    all_in_one_test.go:282: waiting for testing environment to be ready
    all_in_one_test.go:285: creating the kong namespace
    all_in_one_test.go:293: deploying any supplemental secrets (found: 0)
    all_in_one_test.go:299: deploying the /tmp/kictest.909394191 manifest to the cluster
    all_in_one_test.go:306: waiting for kong to be ready
    all_in_one_test.go:128: running ingress tests to verify all-in-one deployed ingress controller and proxy are functional
    all_in_one_test.go:315: deploying an HTTP service to test the ingress controller and proxy
    all_in_one_test.go:321: exposing deployment httpbin via service
    all_in_one_test.go:326: creating an ingress for service httpbin with ingress.class kong
    all_in_one_test.go:335: finding the kong proxy service ip
    all_in_one_test.go:359: waiting for route from Ingress to be operational at http://172.18.0.2:30620/httpbin
--- PASS: TestDeployAllInOneDBLESSNoLoadBalancer (182.86s)

The initial status update loop setup function, ctrlutils.PullConfigUpdate, sets up a channel select that receives ConfigDone events. The config update loop sends to this channel, and blocks if nothing can receive from it. Previously, a failure to build the client configuration necessary to update status would exit PullConfigUpdate. Failure to retrieve the publish service address(es) would block PullConfigUpdate before it began receiving events, and environments that never receive addresses (LoadBalancer Services in clusters that cannot provision LoadBalancers) would block indefinitely. If either of these occurred, The config update loop would run once, block, and never run again. This change avoids this deadlock by having PullConfigUpdate always begin its channel receive. The ConfigDone receiver attempts to initialize configuration and status information if they are not marked ready, and marks them ready if it succeeds. If both are ready, it updates status. If one or the other is not ready, it logs a debug-level message and does nothing.

Add a test that uses a cluster without a load balancer. Add support for overriding the KIC image in E2E tests. Loads an arbitrary image into the E2E test cluster and uses kustomize to replace that image in test manifests. Add support for NodePorts and no-IP LoadBalancers (via their NodePort) when attempting to verify an Ingress.

rainest · 2021-11-19T23:14:47Z

Updated with the actual KTF release and rebased onto main with the previous smaller changes squashed. Now just the commit that fixes status handling and the new E2E test.

The E2E tests normally only run on tags, with the expectation that the manifests in that tag will use an image built for that tag. That's true if you're doing a release (and it's what we want for releases, to test the actual artifacts), but not otherwise. To handle this, the test updates add a shim to see if you've specified an override via the environment, rewrites the stock manifest to use that image, loads the image, and deploys the modified manifest.

Creating the image remains up to the user: this does not add CI logic for building a test image from the current branch and overriding to that. You need to manually build images before you manually run E2E tests with an override. CI will eventually test the changes whenever they're tagged. test.txt is the manual test I ran for this branch. Though not shown, TestDeployAllInOneDBLESSNoLoadBalancer will fail when run with any current released image.

Co-authored-by: Michał Flendrich <michal@flendrich.pro>

rainest · 2021-11-22T17:54:06Z

Re-run following changes. Istio result is garbage since I don't run an SSH agent. Ignoring it tests.txt. Re-running that separately, but don't expect it to have issues since it's a separate codebase than the all in one tests--still in e2e but it doesn't use the deployKong() helper.

Ed: what a noisy test istio.txt

rainest · 2021-11-22T19:24:41Z

The integration tests have failed twice now. The test introduced in 12099e8#diff-23f82f96580f3a9696df4886f9539c8b88cf3a3b2297b71c197e417f9d24e7b3 appears to be flaky and unrelated to this change:

The test is new.
The tested component (the admission server) behaves identically for both Postgres and DB-less systems. It failed once on DB-less (which prevented Postgres from running) and succeeded once on DB-less, but then failed in Postgres on that run.
The failure is due to a missing resource in the Kubernetes API store. I suspect we just need to wait, it's possible there's some lag between the credential create succeeding and it being actually available.

#2029 should deal with this, though it's a bit hard to prove. It's run in CI twice without reproducing the failure observed here.

* fix(status) do not block config updates on failure The initial status update loop setup function, ctrlutils.PullConfigUpdate, sets up a channel select that receives ConfigDone events. The config update loop sends to this channel, and blocks if nothing can receive from it. Previously, a failure to build the client configuration necessary to update status would exit PullConfigUpdate. Failure to retrieve the publish service address(es) would block PullConfigUpdate before it began receiving events, and environments that never receive addresses (LoadBalancer Services in clusters that cannot provision LoadBalancers) would block indefinitely. If either of these occurred, The config update loop would run once, block, and never run again. This change avoids this deadlock by having PullConfigUpdate always begin its channel receive. The ConfigDone receiver attempts to initialize configuration and status information if they are not marked ready, and marks them ready if it succeeds. If both are ready, it updates status. If one or the other is not ready, it logs a debug-level message and does nothing. * test(e2e) verify clusters without LB providers Add a test that uses a cluster without a load balancer. Add support for overriding the KIC image in E2E tests. Loads an arbitrary image into the E2E test cluster and uses kustomize to replace that image in test manifests. Add support for NodePorts and no-IP LoadBalancers (via their NodePort) when attempting to verify an Ingress. Co-authored-by: Michał Flendrich <michal@flendrich.pro>

rainest requested a review from a team as a code owner November 9, 2021 23:51

rainest temporarily deployed to Configure ci November 9, 2021 23:51 Inactive

github-actions bot added the ci/license/unchanged label Nov 9, 2021

rainest force-pushed the fix/status-noblock branch from a03c65d to 458151d Compare November 9, 2021 23:53

rainest temporarily deployed to Configure ci November 9, 2021 23:53 Inactive

rainest temporarily deployed to Configure ci November 10, 2021 00:03 Inactive

mflendrich suggested changes Nov 10, 2021

View reviewed changes

internal/ctrlutils/ingress-status.go Outdated Show resolved Hide resolved

rainest temporarily deployed to Configure ci November 10, 2021 22:45 Inactive

rainest temporarily deployed to Configure ci November 10, 2021 22:54 Inactive

rainest temporarily deployed to Configure ci November 10, 2021 23:11 Inactive

rainest temporarily deployed to Configure ci November 10, 2021 23:20 Inactive

rainest temporarily deployed to Configure ci November 10, 2021 23:23 Inactive

rainest temporarily deployed to Configure ci November 10, 2021 23:33 Inactive

rainest marked this pull request as draft November 12, 2021 01:19

rainest mentioned this pull request Nov 17, 2021

feat: add load image addon Kong/kubernetes-testing-framework#151

Merged

rainest force-pushed the fix/status-noblock branch 2 times, most recently from d6e0faa to 57bfa5b Compare November 19, 2021 23:02

rainest temporarily deployed to Configure ci November 19, 2021 23:02 Inactive

rainest force-pushed the fix/status-noblock branch from 57bfa5b to caafc4e Compare November 19, 2021 23:07

rainest temporarily deployed to Configure ci November 19, 2021 23:07 Inactive

rainest marked this pull request as ready for review November 19, 2021 23:08

rainest requested a review from mflendrich November 19, 2021 23:08

Travis Raines added 2 commits November 19, 2021 15:09

rainest temporarily deployed to Configure ci November 19, 2021 23:26 Inactive

pr: rename import

69bfc02

rainest had a problem deploying to Configure ci November 22, 2021 16:49 Failure

pr: rename all the clients

e5fcee7

rainest had a problem deploying to Configure ci November 22, 2021 16:56 Failure

Travis Raines and others added 2 commits November 22, 2021 08:57

pr: dedup envvar read

8a95d72

pr: handle nodes without addresses

163a640

Co-authored-by: Michał Flendrich <michal@flendrich.pro>

rainest had a problem deploying to Configure ci November 22, 2021 17:05 Failure

rainest had a problem deploying to Configure ci November 22, 2021 17:41 Failure

pr: store modified manifest in memory

bbf4a8c

rainest had a problem deploying to Configure ci November 22, 2021 17:47 Error

rainest requested a review from mflendrich November 22, 2021 17:52

rainest temporarily deployed to Configure ci November 22, 2021 18:03 Inactive

Merge branch 'main' into fix/status-noblock

3968ef1

mflendrich had a problem deploying to Configure ci December 1, 2021 13:13 Error

mflendrich approved these changes Dec 1, 2021

View reviewed changes

mflendrich temporarily deployed to Configure ci December 1, 2021 13:21 Inactive

mflendrich temporarily deployed to Configure ci December 1, 2021 14:34 Inactive

mflendrich temporarily deployed to Configure ci December 1, 2021 14:46 Inactive

rainest enabled auto-merge (squash) December 1, 2021 16:19

Merge branch 'main' into fix/status-noblock

d22a5e9

rainest temporarily deployed to Configure ci December 1, 2021 16:19 Inactive

rainest merged commit 9b33faa into main Dec 1, 2021

rainest deleted the fix/status-noblock branch December 1, 2021 16:32

rainest temporarily deployed to Configure ci December 1, 2021 16:32 Inactive

rainest mentioned this pull request Dec 1, 2021

Add KongIngress types for generated CRDs #1971

Merged

1 task

rainest mentioned this pull request Dec 14, 2021

fix(proxy) give manager leader control over start #2053

Merged

1 task

rainest mentioned this pull request Jan 18, 2022

doc(changelog) add 2.0.7 #2180

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not block config updates if we cannot not set up the status subsystem #2005

Do not block config updates if we cannot not set up the status subsystem #2005

rainest commented Nov 9, 2021 •

edited

Loading

mflendrich left a comment

shaneutt commented Nov 10, 2021

rainest commented Nov 10, 2021

shaneutt commented Nov 11, 2021

rainest commented Nov 12, 2021

rainest commented Nov 19, 2021

rainest commented Nov 22, 2021 •

edited

Loading

rainest commented Nov 22, 2021 •

edited

Loading

Do not block config updates if we cannot not set up the status subsystem #2005

Do not block config updates if we cannot not set up the status subsystem #2005

Conversation

rainest commented Nov 9, 2021 • edited Loading

mflendrich left a comment

Choose a reason for hiding this comment

shaneutt commented Nov 10, 2021

rainest commented Nov 10, 2021

shaneutt commented Nov 11, 2021

rainest commented Nov 12, 2021

rainest commented Nov 19, 2021

rainest commented Nov 22, 2021 • edited Loading

rainest commented Nov 22, 2021 • edited Loading

rainest commented Nov 9, 2021 •

edited

Loading

rainest commented Nov 22, 2021 •

edited

Loading

rainest commented Nov 22, 2021 •

edited

Loading