Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a minimum cluster stability timeout for OpenShift e2e remediation #12184

Merged
merged 1 commit into from
Jul 19, 2024

Conversation

rhmdnd
Copy link
Collaborator

@rhmdnd rhmdnd commented Jul 18, 2024

The OpenShift content has a manual remeidation for setting up an
identity provider, which includes creating a secret, updating the
authentication configuration, and bouncing the authentication operator.

The test suite needs to make sure the authentication operator is up and
ready before it continues, and we recently updated the logic to make
sure it was ready by using the oc adm wait-for-stable-cluster command.
This command is ideal because it checks all cluster operators are
running and stable, not just the authentication operator.

One side-effect of using this command though is that it polls on
successful cluster conditions. We were using it from the perspective
that the command would exit gracefully when the cluster was stable,
which isn't the case.

However, we can pass and argument to the command to force an exit after
the cluster is stable for a certain period of time. This commit updates
the command to do that so that the command doesn't hang and cause
timeouts in our testing.

The OpenShift content has a manual remeidation for setting up an
identity provider, which includes creating a secret, updating the
authentication configuration, and bouncing the authentication operator.

The test suite needs to make sure the authentication operator is up and
ready before it continues, and we recently updated the logic to make
sure it was ready by using the `oc adm wait-for-stable-cluster` command.
This command is ideal because it checks all cluster operators are
running and stable, not just the authentication operator.

One side-effect of using this command though is that it polls on
successful cluster conditions. We were using it from the perspective
that the command would exit gracefully when the cluster was stable,
which isn't the case.

However, we can pass and argument to the command to force an exit after
the cluster is stable for a certain period of time. This commit updates
the command to do that so that the command doesn't hang and cause
timeouts in our testing.
@rhmdnd rhmdnd added the OpenShift OpenShift product related. label Jul 18, 2024
@rhmdnd rhmdnd requested review from yuumasato and Vincent056 July 18, 2024 14:37
Copy link

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment
Open in Gitpod

Oracle Linux 8 Environment
Open in Gitpod

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 18, 2024

/test

Copy link

openshift-ci bot commented Jul 18, 2024

@rhmdnd: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test 4.13-e2e-aws-ocp4-bsi
  • /test 4.13-e2e-aws-ocp4-bsi-node
  • /test 4.13-e2e-aws-ocp4-cis
  • /test 4.13-e2e-aws-ocp4-cis-node
  • /test 4.13-e2e-aws-ocp4-e8
  • /test 4.13-e2e-aws-ocp4-high
  • /test 4.13-e2e-aws-ocp4-high-node
  • /test 4.13-e2e-aws-ocp4-moderate
  • /test 4.13-e2e-aws-ocp4-moderate-node
  • /test 4.13-e2e-aws-ocp4-pci-dss
  • /test 4.13-e2e-aws-ocp4-pci-dss-node
  • /test 4.13-e2e-aws-ocp4-stig
  • /test 4.13-e2e-aws-ocp4-stig-node
  • /test 4.13-e2e-aws-rhcos4-bsi
  • /test 4.13-e2e-aws-rhcos4-e8
  • /test 4.13-e2e-aws-rhcos4-high
  • /test 4.13-e2e-aws-rhcos4-moderate
  • /test 4.13-e2e-aws-rhcos4-stig
  • /test 4.13-images
  • /test 4.14-e2e-aws-ocp4-bsi
  • /test 4.14-e2e-aws-ocp4-bsi-node
  • /test 4.14-e2e-aws-rhcos4-bsi
  • /test 4.14-images
  • /test 4.15-e2e-aws-ocp4-bsi
  • /test 4.15-e2e-aws-ocp4-bsi-node
  • /test 4.15-e2e-aws-ocp4-cis
  • /test 4.15-e2e-aws-ocp4-cis-node
  • /test 4.15-e2e-aws-ocp4-e8
  • /test 4.15-e2e-aws-ocp4-high
  • /test 4.15-e2e-aws-ocp4-high-node
  • /test 4.15-e2e-aws-ocp4-moderate
  • /test 4.15-e2e-aws-ocp4-moderate-node
  • /test 4.15-e2e-aws-ocp4-pci-dss
  • /test 4.15-e2e-aws-ocp4-pci-dss-node
  • /test 4.15-e2e-aws-ocp4-stig
  • /test 4.15-e2e-aws-ocp4-stig-node
  • /test 4.15-e2e-aws-rhcos4-bsi
  • /test 4.15-e2e-aws-rhcos4-e8
  • /test 4.15-e2e-aws-rhcos4-high
  • /test 4.15-e2e-aws-rhcos4-moderate
  • /test 4.15-e2e-aws-rhcos4-stig
  • /test 4.15-e2e-rosa-ocp4-cis-node
  • /test 4.15-e2e-rosa-ocp4-pci-dss-node
  • /test 4.15-images
  • /test 4.16-e2e-aws-ocp4-bsi
  • /test 4.16-e2e-aws-ocp4-bsi-node
  • /test 4.16-e2e-aws-ocp4-cis
  • /test 4.16-e2e-aws-ocp4-cis-node
  • /test 4.16-e2e-aws-ocp4-e8
  • /test 4.16-e2e-aws-ocp4-high
  • /test 4.16-e2e-aws-ocp4-high-node
  • /test 4.16-e2e-aws-ocp4-moderate
  • /test 4.16-e2e-aws-ocp4-moderate-node
  • /test 4.16-e2e-aws-ocp4-pci-dss
  • /test 4.16-e2e-aws-ocp4-pci-dss-node
  • /test 4.16-e2e-aws-ocp4-stig
  • /test 4.16-e2e-aws-ocp4-stig-node
  • /test 4.16-e2e-aws-rhcos4-bsi
  • /test 4.16-e2e-aws-rhcos4-e8
  • /test 4.16-e2e-aws-rhcos4-high
  • /test 4.16-e2e-aws-rhcos4-moderate
  • /test 4.16-e2e-aws-rhcos4-stig
  • /test 4.16-images
  • /test e2e-aws-ocp4-bsi
  • /test e2e-aws-ocp4-bsi-node
  • /test e2e-aws-ocp4-cis
  • /test e2e-aws-ocp4-cis-node
  • /test e2e-aws-ocp4-e8
  • /test e2e-aws-ocp4-high
  • /test e2e-aws-ocp4-high-node
  • /test e2e-aws-ocp4-moderate
  • /test e2e-aws-ocp4-moderate-node
  • /test e2e-aws-ocp4-pci-dss
  • /test e2e-aws-ocp4-pci-dss-node
  • /test e2e-aws-ocp4-stig
  • /test e2e-aws-ocp4-stig-node
  • /test e2e-aws-rhcos4-bsi
  • /test e2e-aws-rhcos4-e8
  • /test e2e-aws-rhcos4-high
  • /test e2e-aws-rhcos4-moderate
  • /test e2e-aws-rhcos4-stig
  • /test images

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-ComplianceAsCode-content-master-4.13-images
  • pull-ci-ComplianceAsCode-content-master-4.14-images
  • pull-ci-ComplianceAsCode-content-master-4.15-images
  • pull-ci-ComplianceAsCode-content-master-4.16-images
  • pull-ci-ComplianceAsCode-content-master-images

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 18, 2024

/test 4.13-e2e-aws-ocp4-cis
/test 4.15-e2e-aws-ocp4-cis
/test 4.16-e2e-aws-ocp4-cis
/test e2e-aws-ocp4-cis

Copy link

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:12184
This image was built from commit: 7e3abc1

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:12184

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:12184 make deploy-local

Copy link

codeclimate bot commented Jul 18, 2024

Code Climate has analyzed commit 7e3abc1 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 59.4% (0.0% change).

View more on Code Climate.

@rhmdnd rhmdnd added this to the 0.1.74 milestone Jul 18, 2024
Copy link
Member

@yuumasato yuumasato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thank you for the fix

@yuumasato yuumasato merged commit 9c9ec43 into ComplianceAsCode:master Jul 19, 2024
97 of 99 checks passed
rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 26, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by generating a certificate for testing purposes, then
creates a ConfigMap called `trusted-ca-bundle`, before updating the
trusted CA.
rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 26, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by generating a certificate for testing purposes.
rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 30, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by creating a configmap for the expected certificate bundle.
rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 31, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by creating a configmap for the expected certificate bundle.
vojtapolasek pushed a commit to vojtapolasek/content that referenced this pull request Aug 1, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by creating a configmap for the expected certificate bundle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OpenShift OpenShift product related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants