Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI tests failing for flux integration #5827

Closed
absoludity opened this issue Dec 22, 2022 · 3 comments · Fixed by #5841
Closed

CI tests failing for flux integration #5827

absoludity opened this issue Dec 22, 2022 · 3 comments · Fixed by #5841
Assignees
Labels
component/ci Issue related to kubeapps ci system kind/bug An issue that reports a defect in an existing feature

Comments

@absoludity
Copy link
Contributor

absoludity commented Dec 22, 2022

Summary
A few commits ago, the CI tests for our main branch began failing. The last successful CI test of main was when #5798 landed, syncing the upstream chart changes for the 12.1.3 chart release.

At the time of writing, there are only three commits on main since, all of which failed once landed (though each PR passed before landing).

The flux test fails while waiting for the deployed package to show pods (for the pie chart to be populate):

  1) [chromium] › tests/flux/04-default-deployment.spec.js:8:1 › Deploys podinfo package with default values in main cluster 

    page.waitForSelector: Timeout 180000ms exceeded.
    =========================== logs ===========================
    waiting for locator('.application-status-pie-chart-number').locator('text=1') to be visible
    ============================================================

      34 |
      35 |   // Assertions
    > 36 |   await page.waitForSelector("css=.application-status-pie-chart-number >> text=1", {
         |              ^
      37 |     timeout: utils.getDeploymentTimeout(),
      38 |   });
      39 |   await page.waitForSelector("css=.application-status-pie-chart-title >> text=Ready", {

        at /app/tests/flux/04-default-deployment.spec.js:36:14

but the screenshot shows that there's nothing available by the timeout:

test-failed-1

(creating the issue to document what's known right now as I won't get to this before the break).

@kubeapps-bot kubeapps-bot moved this to 🗂 Backlog in Kubeapps Dec 22, 2022
@absoludity absoludity moved this from 🗂 Backlog to 🗒 Todo in Kubeapps Dec 22, 2022
@ppbaena ppbaena added kind/bug An issue that reports a defect in an existing feature component/ci Issue related to kubeapps ci system labels Dec 22, 2022
@ppbaena ppbaena added this to the Technical debt milestone Dec 22, 2022
@absoludity absoludity self-assigned this Jan 9, 2023
@absoludity
Copy link
Contributor Author

Updating to be able to view the logs of the failed flux ci test (#5841 ) shows

 I0109 04:40:43.864912       1 server.go:62] OK 101.857139ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetAvailablePackageDetail
I0109 04:40:45.771535       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:45.771827       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:45.797575       1 server.go:62] NotFound 26.050191ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs
I0109 04:40:47.808058       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:47.808098       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:47.844990       1 server.go:62] NotFound 36.940972ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs
I0109 04:40:53.938159       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:53.938217       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:53.964418       1 server.go:62] NotFound 26.270317ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs
I0109 04:40:58.016700       1 packages.go:245] "+core GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" identifier="test-04-release-46045"
I0109 04:40:58.016767       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-04-release-46045"
I0109 04:40:58.039655       1 server.go:62] NotFound 22.971239ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs

The NotFound error appears to be returned by the shared resourcerefs library:

if err == driver.ErrReleaseNotFound {
return nil, status.Errorf(codes.NotFound, "Unable to find Helm release %q in namespace %q: %+v", helmReleaseName, namespace, err)
}

which implies that the helm command that was run to get the helm release manifest for podinfo in the kubeapps-user-namespace itself returned ErrReleaeseNotFound.

@absoludity
Copy link
Contributor Author

absoludity commented Jan 9, 2023

More debugging via logs shows:

I0109 06:21:45.930588       1 resourcerefs.go:96] "+resourcerefs GetInstalledPackageResourceRefs" helmReleaseName="kubeapps-user-namespace/test-04-release-73806"
E0109 06:21:45.932943       1 resourcerefs.go:115] "resourcerefs GetInstalledPackageResourceRefs" err="release: not found"

so at this point I can only assume that the error is indeed correct: the helm release does not exist. So I need to check the flux controllers and logs to see why.

Trying to reproduce locally, I notice that six flux controllers are installed, when I think we only need 2, this could save us some CPU usage:

NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
helm-controller               0/1     1            0           33s
image-automation-controller   0/1     1            0           33s
image-reflector-controller    0/1     1            0           33s
kustomize-controller          0/1     1            0           33s
notification-controller       0/1     1            0           33s
source-controller             0/1     1            0           33s

I can scale three of those down to zero and still deploy the flux helm release without issue:

k -n flux-system get deployments                                      
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
helm-controller               1/1     1            1           17m
image-automation-controller   0/0     0            0           17m
image-reflector-controller    0/0     0            0           17m
kustomize-controller          0/0     0            0           17m
notification-controller       1/1     1            1           17m
source-controller             1/1     1            1           17m

If I scale the helm-controller to zero, I reproduce the issue seen in CI both visually, as well in the logs:

I0109 23:09:36.737521       1 server.go:490] "+fluxv2 GetInstalledPackageResourceRefs" cluster="default" namespace="kubeapps-user-namespace" id="test-6"
I0109 23:09:36.757668       1 server.go:62] NotFound 20.196643ms /kubeappsapis.core.packages.v1alpha1.PackagesService/GetInstalledPackageResourceRefs

since the helm release is never created. EDIT: Actually, this is different to what CI sees visually, in that I see that the helm release is not created, but in my case it's because Kubeapps created the flux HelmRelease but it is never updated (as the controller isn't running), but in the CI case, as per the image above, the HelmRelease is updated to pending, but never moves out of pending. So next step is to get the info about the HelmRelease.

So, my suspicion is that the Helm controller is erroring or not having enough resources (again), and one simple way to remedy this may be to scale down the unnecessary deployments as above.

@absoludity
Copy link
Contributor Author

absoludity commented Jan 10, 2023

Woo - so printing the logs of flux's helm-controller reveals:

Helm install failed: chart requires kubeVersion: >=1.23.0-0 which is incompatible with Kubernetes v1.22.15

{"level":"error","ts":"2023-01-10T03:04:14.945Z","msg":"Reconciler error","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"test-04-release-36547","namespace":"kubeapps-user-namespace"},"namespace":"kubeapps-user-namespace","name":"test-04-release-36547","reconcileID":"45d5758e-2170-45e7-b5e7-70d800d95da8","error":"Helm install failed: chart requires kubeVersion: >=1.23.0-0 which is incompatible with Kubernetes v1.22.15"}

So in the end, it was an update of the podinfo chart, since we don't deploy a specific version. I didn't see this locally since we use 1.24 in the dev environment. To avoid the time required for investigation next time, I recommend we use a specific version of podinfo for tests (so we control when it updates).

absoludity added a commit that referenced this issue Jan 10, 2023
Signed-off-by: Michael Nelson <minelson@vmware.com>

<!--
Before you open the request please review the following guidelines and
tips to help it be more easily integrated:

 - Describe the scope of your change - i.e. what the change does.
 - Describe any known limitations with your change.
- Please run any tests or examples that can exercise your modified code.

 Thank you for contributing!
 -->

### Description of the change

The main update is to switch to k8s 1.24.7 in our CI (and dev)
environments, because CI was running with 1.22 and the podinfo chart
that the flux tests use to test is no longer compatible with 1.22.

As a result, we also needed to create specific service token secrets and
update the way CI gets those.

There's also some extra logging to help trace issues in the future.
<!-- Describe the scope of your change - i.e. what the change does. -->

### Benefits

<!-- What benefits will be realized by the code change? -->
CI passes again.

### Possible drawbacks

<!-- Describe any known limitations with your change -->

### Applicable issues

<!-- Enter any applicable Issues here (You can reference an issue using
#) -->

- fixes #5827 

### Additional information

<!-- If there's anything else that's important and relevant to your pull
request, mention that information here.-->

Signed-off-by: Michael Nelson <minelson@vmware.com>
@github-project-automation github-project-automation bot moved this from 🗒 Todo to ✅ Done in Kubeapps Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ci Issue related to kubeapps ci system kind/bug An issue that reports a defect in an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants