-
Notifications
You must be signed in to change notification settings - Fork 711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a minimum cluster stability timeout for OpenShift e2e remediation #12184
Add a minimum cluster stability timeout for OpenShift e2e remediation #12184
Conversation
The OpenShift content has a manual remeidation for setting up an identity provider, which includes creating a secret, updating the authentication configuration, and bouncing the authentication operator. The test suite needs to make sure the authentication operator is up and ready before it continues, and we recently updated the logic to make sure it was ready by using the `oc adm wait-for-stable-cluster` command. This command is ideal because it checks all cluster operators are running and stable, not just the authentication operator. One side-effect of using this command though is that it polls on successful cluster conditions. We were using it from the perspective that the command would exit gracefully when the cluster was stable, which isn't the case. However, we can pass and argument to the command to force an exit after the cluster is stable for a certain period of time. This commit updates the command to do that so that the command doesn't hang and cause timeouts in our testing.
/test |
@rhmdnd: The
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test 4.13-e2e-aws-ocp4-cis |
🤖 A k8s content image for this PR is available at: Click here to see how to deploy itIf you alread have Compliance Operator deployed: Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and: |
Code Climate has analyzed commit 7e3abc1 and detected 0 issues on this pull request. The test coverage on the diff in this pull request is 100.0% (50% is the threshold). This pull request will bring the total coverage in the repository to 59.4% (0.0% change). View more on Code Climate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thank you for the fix
Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by generating a certificate for testing purposes, then creates a ConfigMap called `trusted-ca-bundle`, before updating the trusted CA.
Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by generating a certificate for testing purposes.
Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by creating a configmap for the expected certificate bundle.
Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by creating a configmap for the expected certificate bundle.
Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by creating a configmap for the expected certificate bundle.
The OpenShift content has a manual remeidation for setting up an
identity provider, which includes creating a secret, updating the
authentication configuration, and bouncing the authentication operator.
The test suite needs to make sure the authentication operator is up and
ready before it continues, and we recently updated the logic to make
sure it was ready by using the
oc adm wait-for-stable-cluster
command.This command is ideal because it checks all cluster operators are
running and stable, not just the authentication operator.
One side-effect of using this command though is that it polls on
successful cluster conditions. We were using it from the perspective
that the command would exit gracefully when the cluster was stable,
which isn't the case.
However, we can pass and argument to the command to force an exit after
the cluster is stable for a certain period of time. This commit updates
the command to do that so that the command doesn't hang and cause
timeouts in our testing.