Add a minimum cluster stability timeout for OpenShift e2e remediation #12184

rhmdnd · 2024-07-18T14:37:13Z

The OpenShift content has a manual remeidation for setting up an
identity provider, which includes creating a secret, updating the
authentication configuration, and bouncing the authentication operator.

The test suite needs to make sure the authentication operator is up and
ready before it continues, and we recently updated the logic to make
sure it was ready by using the oc adm wait-for-stable-cluster command.
This command is ideal because it checks all cluster operators are
running and stable, not just the authentication operator.

One side-effect of using this command though is that it polls on
successful cluster conditions. We were using it from the perspective
that the command would exit gracefully when the cluster was stable,
which isn't the case.

However, we can pass and argument to the command to force an exit after
the cluster is stable for a certain period of time. This commit updates
the command to do that so that the command doesn't hang and cause
timeouts in our testing.

The OpenShift content has a manual remeidation for setting up an identity provider, which includes creating a secret, updating the authentication configuration, and bouncing the authentication operator. The test suite needs to make sure the authentication operator is up and ready before it continues, and we recently updated the logic to make sure it was ready by using the `oc adm wait-for-stable-cluster` command. This command is ideal because it checks all cluster operators are running and stable, not just the authentication operator. One side-effect of using this command though is that it polls on successful cluster conditions. We were using it from the perspective that the command would exit gracefully when the cluster was stable, which isn't the case. However, we can pass and argument to the command to force an exit after the cluster is stable for a certain period of time. This commit updates the command to do that so that the command doesn't hang and cause timeouts in our testing.

github-actions · 2024-07-18T14:38:50Z

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment

Oracle Linux 8 Environment

rhmdnd · 2024-07-18T14:38:58Z

/test

openshift-ci · 2024-07-18T14:39:01Z

@rhmdnd: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test 4.13-e2e-aws-ocp4-bsi
/test 4.13-e2e-aws-ocp4-bsi-node
/test 4.13-e2e-aws-ocp4-cis
/test 4.13-e2e-aws-ocp4-cis-node
/test 4.13-e2e-aws-ocp4-e8
/test 4.13-e2e-aws-ocp4-high
/test 4.13-e2e-aws-ocp4-high-node
/test 4.13-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate-node
/test 4.13-e2e-aws-ocp4-pci-dss
/test 4.13-e2e-aws-ocp4-pci-dss-node
/test 4.13-e2e-aws-ocp4-stig
/test 4.13-e2e-aws-ocp4-stig-node
/test 4.13-e2e-aws-rhcos4-bsi
/test 4.13-e2e-aws-rhcos4-e8
/test 4.13-e2e-aws-rhcos4-high
/test 4.13-e2e-aws-rhcos4-moderate
/test 4.13-e2e-aws-rhcos4-stig
/test 4.13-images
/test 4.14-e2e-aws-ocp4-bsi
/test 4.14-e2e-aws-ocp4-bsi-node
/test 4.14-e2e-aws-rhcos4-bsi
/test 4.14-images
/test 4.15-e2e-aws-ocp4-bsi
/test 4.15-e2e-aws-ocp4-bsi-node
/test 4.15-e2e-aws-ocp4-cis
/test 4.15-e2e-aws-ocp4-cis-node
/test 4.15-e2e-aws-ocp4-e8
/test 4.15-e2e-aws-ocp4-high
/test 4.15-e2e-aws-ocp4-high-node
/test 4.15-e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate-node
/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.15-e2e-aws-ocp4-pci-dss-node
/test 4.15-e2e-aws-ocp4-stig
/test 4.15-e2e-aws-ocp4-stig-node
/test 4.15-e2e-aws-rhcos4-bsi
/test 4.15-e2e-aws-rhcos4-e8
/test 4.15-e2e-aws-rhcos4-high
/test 4.15-e2e-aws-rhcos4-moderate
/test 4.15-e2e-aws-rhcos4-stig
/test 4.15-e2e-rosa-ocp4-cis-node
/test 4.15-e2e-rosa-ocp4-pci-dss-node
/test 4.15-images
/test 4.16-e2e-aws-ocp4-bsi
/test 4.16-e2e-aws-ocp4-bsi-node
/test 4.16-e2e-aws-ocp4-cis
/test 4.16-e2e-aws-ocp4-cis-node
/test 4.16-e2e-aws-ocp4-e8
/test 4.16-e2e-aws-ocp4-high
/test 4.16-e2e-aws-ocp4-high-node
/test 4.16-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate-node
/test 4.16-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss-node
/test 4.16-e2e-aws-ocp4-stig
/test 4.16-e2e-aws-ocp4-stig-node
/test 4.16-e2e-aws-rhcos4-bsi
/test 4.16-e2e-aws-rhcos4-e8
/test 4.16-e2e-aws-rhcos4-high
/test 4.16-e2e-aws-rhcos4-moderate
/test 4.16-e2e-aws-rhcos4-stig
/test 4.16-images
/test e2e-aws-ocp4-bsi
/test e2e-aws-ocp4-bsi-node
/test e2e-aws-ocp4-cis
/test e2e-aws-ocp4-cis-node
/test e2e-aws-ocp4-e8
/test e2e-aws-ocp4-high
/test e2e-aws-ocp4-high-node
/test e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate-node
/test e2e-aws-ocp4-pci-dss
/test e2e-aws-ocp4-pci-dss-node
/test e2e-aws-ocp4-stig
/test e2e-aws-ocp4-stig-node
/test e2e-aws-rhcos4-bsi
/test e2e-aws-rhcos4-e8
/test e2e-aws-rhcos4-high
/test e2e-aws-rhcos4-moderate
/test e2e-aws-rhcos4-stig
/test images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-ComplianceAsCode-content-master-4.13-images
pull-ci-ComplianceAsCode-content-master-4.14-images
pull-ci-ComplianceAsCode-content-master-4.15-images
pull-ci-ComplianceAsCode-content-master-4.16-images
pull-ci-ComplianceAsCode-content-master-images

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rhmdnd · 2024-07-18T14:39:41Z

/test 4.13-e2e-aws-ocp4-cis
/test 4.15-e2e-aws-ocp4-cis
/test 4.16-e2e-aws-ocp4-cis
/test e2e-aws-ocp4-cis

github-actions · 2024-07-18T14:58:03Z

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:12184
This image was built from commit: 7e3abc1

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:12184

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:12184 make deploy-local

codeclimate · 2024-07-18T15:18:27Z

Code Climate has analyzed commit 7e3abc1 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 59.4% (0.0% change).

View more on Code Climate.

yuumasato

/lgtm

Thank you for the fix

Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by generating a certificate for testing purposes, then creates a ConfigMap called `trusted-ca-bundle`, before updating the trusted CA.

Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by generating a certificate for testing purposes.

Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by creating a configmap for the expected certificate bundle.

rhmdnd added the OpenShift OpenShift product related. label Jul 18, 2024

rhmdnd requested review from yuumasato and Vincent056 July 18, 2024 14:37

rhmdnd added this to the 0.1.74 milestone Jul 18, 2024

yuumasato approved these changes Jul 19, 2024

View reviewed changes

yuumasato merged commit 9c9ec43 into ComplianceAsCode:master Jul 19, 2024
97 of 99 checks passed

marcusburghardt assigned yuumasato Jul 19, 2024

This was referenced Jul 19, 2024

Use OpenShift 4.16 for ComplianceAsCode/content CI openshift/release#49323

Merged

🧹 upgrade golang to 1.22 ComplianceAsCode/ocp4e2e#47

Open

rhmdnd mentioned this pull request Jul 26, 2024

Generate a temp certificate for OCP4 Trusted CA remediation #12226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a minimum cluster stability timeout for OpenShift e2e remediation #12184

Add a minimum cluster stability timeout for OpenShift e2e remediation #12184

rhmdnd commented Jul 18, 2024

github-actions bot commented Jul 18, 2024

rhmdnd commented Jul 18, 2024

openshift-ci bot commented Jul 18, 2024

rhmdnd commented Jul 18, 2024

github-actions bot commented Jul 18, 2024

codeclimate bot commented Jul 18, 2024

yuumasato left a comment

Add a minimum cluster stability timeout for OpenShift e2e remediation #12184

Add a minimum cluster stability timeout for OpenShift e2e remediation #12184

Conversation

rhmdnd commented Jul 18, 2024

github-actions bot commented Jul 18, 2024

rhmdnd commented Jul 18, 2024

openshift-ci bot commented Jul 18, 2024

rhmdnd commented Jul 18, 2024

github-actions bot commented Jul 18, 2024

codeclimate bot commented Jul 18, 2024

yuumasato left a comment

Choose a reason for hiding this comment