CI tests failing for flux integration #5827

absoludity · 2022-12-22T05:17:32Z

Summary
A few commits ago, the CI tests for our main branch began failing. The last successful CI test of main was when #5798 landed, syncing the upstream chart changes for the 12.1.3 chart release.

At the time of writing, there are only three commits on main since, all of which failed once landed (though each PR passed before landing).

The flux test fails while waiting for the deployed package to show pods (for the pie chart to be populate):

  1) [chromium] › tests/flux/04-default-deployment.spec.js:8:1 › Deploys podinfo package with default values in main cluster 

    page.waitForSelector: Timeout 180000ms exceeded.
    =========================== logs ===========================
    waiting for locator('.application-status-pie-chart-number').locator('text=1') to be visible
    ============================================================

      34 |
      35 |   // Assertions
    > 36 |   await page.waitForSelector("css=.application-status-pie-chart-number >> text=1", {
         |              ^
      37 |     timeout: utils.getDeploymentTimeout(),
      38 |   });
      39 |   await page.waitForSelector("css=.application-status-pie-chart-title >> text=Ready", {

        at /app/tests/flux/04-default-deployment.spec.js:36:14

but the screenshot shows that there's nothing available by the timeout:

(creating the issue to document what's known right now as I won't get to this before the break).

The text was updated successfully, but these errors were encountered:

absoludity · 2023-01-09T05:19:34Z

Updating to be able to view the logs of the failed flux ci test (#5841 ) shows

 I0109 04:40:43.864912       1 server.go:62] OK 101.857139ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetAvailablePackageDetail
I0109 04:40:45.771535       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:45.771827       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:45.797575       1 server.go:62] NotFound 26.050191ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs
I0109 04:40:47.808058       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:47.808098       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:47.844990       1 server.go:62] NotFound 36.940972ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs
I0109 04:40:53.938159       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:53.938217       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:53.964418       1 server.go:62] NotFound 26.270317ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs
I0109 04:40:58.016700       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:58.016767       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:58.039655       1 server.go:62] NotFound 22.971239ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs

The NotFound error appears to be returned by the shared resourcerefs library:

kubeapps/cmd/kubeapps-apis/plugins/pkg/resourcerefs/resourcerefs.go

Lines 111 to 113 in fa761a0

    
           if err == driver.ErrReleaseNotFound { 
        
           	return nil, status.Errorf(codes.NotFound, "Unable to find Helm release %q in namespace %q: %+v", helmReleaseName, namespace, err) 
        
           }

which implies that the helm command that was run to get the helm release manifest for podinfo in the kubeapps-user-namespace itself returned ErrReleaeseNotFound.

absoludity · 2023-01-09T23:16:43Z

More debugging via logs shows:

I0109 06:21:45.930588       1 resourcerefs.go:96] "+resourcerefs GetInstalledPackageResourceRefs" helmReleaseName="kubeapps-user-namespace/test-04-release-73806"
E0109 06:21:45.932943       1 resourcerefs.go:115] "resourcerefs GetInstalledPackageResourceRefs" err="release: not found"

so at this point I can only assume that the error is indeed correct: the helm release does not exist. So I need to check the flux controllers and logs to see why.

Trying to reproduce locally, I notice that six flux controllers are installed, when I think we only need 2, this could save us some CPU usage:

NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
helm-controller               0/1     1            0           33s
image-automation-controller   0/1     1            0           33s
image-reflector-controller    0/1     1            0           33s
kustomize-controller          0/1     1            0           33s
notification-controller       0/1     1            0           33s
source-controller             0/1     1            0           33s

I can scale three of those down to zero and still deploy the flux helm release without issue:

k -n flux-system get deployments                                      
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
helm-controller               1/1     1            1           17m
image-automation-controller   0/0     0            0           17m
image-reflector-controller    0/0     0            0           17m
kustomize-controller          0/0     0            0           17m
notification-controller       1/1     1            1           17m
source-controller             1/1     1            1           17m

If I scale the helm-controller to zero, I reproduce the issue seen in CI both visually, as well in the logs:

I0109 23:09:36.737521       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-6"
I0109 23:09:36.757668       1 server.go:62] NotFound 20.196643ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs

since the helm release is never created. EDIT: Actually, this is different to what CI sees visually, in that I see that the helm release is not created, but in my case it's because Kubeapps created the flux HelmRelease but it is never updated (as the controller isn't running), but in the CI case, as per the image above, the HelmRelease is updated to pending, but never moves out of pending. So next step is to get the info about the HelmRelease.

~~So, my suspicion is that the Helm controller is erroring or not having enough resources (again), and one simple way to remedy this may be to scale down the unnecessary deployments as above.~~

absoludity · 2023-01-10T03:16:30Z

Woo - so printing the logs of flux's helm-controller reveals:

Helm install failed: chart requires kubeVersion: >=1.23.0-0 which is incompatible with Kubernetes v1.22.15

{"level":"error","ts":"2023-01-10T03:04:14.945Z","msg":"Reconciler error","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"test-04-release-36547","namespace":"kubeapps-user-namespace"},"namespace":"kubeapps-user-namespace","name":"test-04-release-36547","reconcileID":"45d5758e-2170-45e7-b5e7-70d800d95da8","error":"Helm install failed: chart requires kubeVersion: >=1.23.0-0 which is incompatible with Kubernetes v1.22.15"}

So in the end, it was an update of the podinfo chart, since we don't deploy a specific version. I didn't see this locally since we use 1.24 in the dev environment. To avoid the time required for investigation next time, I recommend we use a specific version of podinfo for tests (so we control when it updates).

Signed-off-by: Michael Nelson <minelson@vmware.com>  ### Description of the change The main update is to switch to k8s 1.24.7 in our CI (and dev) environments, because CI was running with 1.22 and the podinfo chart that the flux tests use to test is no longer compatible with 1.22. As a result, we also needed to create specific service token secrets and update the way CI gets those. There's also some extra logging to help trace issues in the future.  ### Benefits  CI passes again. ### Possible drawbacks  ### Applicable issues  - fixes #5827 ### Additional information  Signed-off-by: Michael Nelson <minelson@vmware.com>

kubeapps-bot added this to Kubeapps Dec 22, 2022

kubeapps-bot moved this to 🗂 Backlog in Kubeapps Dec 22, 2022

absoludity moved this from 🗂 Backlog to 🗒 Todo in Kubeapps Dec 22, 2022

ppbaena added kind/bug An issue that reports a defect in an existing feature component/ci Issue related to kubeapps ci system labels Dec 22, 2022

ppbaena added this to the Technical debt milestone Dec 22, 2022

absoludity self-assigned this Jan 9, 2023

antgamdia mentioned this issue Jan 9, 2023

Add some fixes for the markdown linter #5842

Merged

absoludity mentioned this issue Jan 10, 2023

Update k8s version in CI to unblock CI tests #5841

Merged

absoludity closed this as completed in #5841 Jan 10, 2023

github-project-automation bot moved this from 🗒 Todo to ✅ Done in Kubeapps Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI tests failing for flux integration #5827

CI tests failing for flux integration #5827

absoludity commented Dec 22, 2022 •

edited

Loading

absoludity commented Jan 9, 2023

absoludity commented Jan 9, 2023 •

edited

Loading

absoludity commented Jan 10, 2023 •

edited

Loading

CI tests failing for flux integration #5827

CI tests failing for flux integration #5827

Comments

absoludity commented Dec 22, 2022 • edited Loading

absoludity commented Jan 9, 2023

absoludity commented Jan 9, 2023 • edited Loading

absoludity commented Jan 10, 2023 • edited Loading

absoludity commented Dec 22, 2022 •

edited

Loading

absoludity commented Jan 9, 2023 •

edited

Loading

absoludity commented Jan 10, 2023 •

edited

Loading