-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO: clear out failing status on success and add tests #442
Conversation
return c.innerConfigClient.ConfigV1().ClusterOperators().Get(name, options) | ||
} | ||
|
||
type ClusterOperatorsClientInterface interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important for us to be able to control what we use in the MCO itself and allow us to create meaningful mocks for proper unit tests.
pkg/operator/status_test.go
Outdated
return coc.co, nil | ||
} | ||
|
||
func TestOperatorFailingStatusClearsOut(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test fail correctly w/o the patch to fix the bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll also add the other tests for the other conditions to make sure we never regress anymore on this...
Unit test failures is #417 /retest |
b88e650
to
2d6f764
Compare
OK yeah, I was looking at this code recently to add some logging and noticed that we never seemed to set failing to false. /lgtm |
I'm trying to cover the other cases in another PR |
lots of tests failed in the last run https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws/1915 due to apiserver not being up (?) |
the MCO cluster operator looks fine though |
/retest Please review the full test history for this PR and help us cut down flakes. |
2d6f764
to
d034512
Compare
rebased to solve merge conflicts |
d034512
to
fff01c3
Compare
Flake unit tracked in #417 /retest |
I'm debugging the latest failure in e2e-aws-op where the machine-config-operator didn't even come up, along with many other pods missing /retest |
oh, looks like the worker nodes weren't all available https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/583/artifacts/e2e-aws-op/nodes.json (3/6), not sure how to further debug to understand why 🤔 |
it looks like @cgwalters failure reported on slack (https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/426/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/561) basically only masters go up (but with a different failure mode) |
fff01c3
to
ce883de
Compare
e21b9c7
to
ed3747b
Compare
This went green again, so I've pushed again to rekick tests aws-e2e-op runs are here https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/ |
Haproxy flake, rest is cool /retest |
green again, retesting |
d0e2a1a
to
987d5de
Compare
green again, repushed to kick tests again here's the status so far:
|
All green again Since the last 8 e2e-aws and e2e-aws-op runs (so from when I pushed the final revision of this PR):
|
Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>
Specifically, add a test to verify that we're clearing out the failing condition on subsequent sync operations (covering a bug reported here openshift#406 (comment)) Signed-off-by: Antonio Murdaca <runcom@linux.com>
Signed-off-by: Antonio Murdaca <runcom@linux.com>
Signed-off-by: Antonio Murdaca <runcom@linux.com>
one more time... |
987d5de
to
2a582a9
Compare
/kind bug |
Haproxy flake again /retest |
failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing) | ||
message := fmt.Sprintf("Cluster has deployed %s", optrVersion) | ||
|
||
available := configv1.ConditionTrue | ||
|
||
if failing && !progressing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so failing now always means !available? OK, that looks like what the CVO does as well I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding for the MCO is that if we fail the sync during a progressing, we could have e.g. applied a new MCO or something else that can likely misbehave if at some point we failed and thus no reason to report available. @abhinavdahiya what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, when looking at https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions, there's an example:
If an error blocks reaching 4.0.1, the conditions might be:
Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server.
Available is true with message Cluster has deployed 4.0.0
Progressing is true with message Unable to apply 4.0.1: a required object is missing
I'm not sure how to read that, if any of the syncFuncs
here fails during a progressing, the status of the MCO isn't really available cause we may have e.g. started rolling a new mcc but the mcd broke and makes little sense to report Available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, maybe that !progressing
is still right because the bug was really that we weren't clearing failed on successful sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch accounts for that case though and the test changed accounts for my bad assumption (all other tests are fine anyway with this):
diff --git a/pkg/operator/status.go b/pkg/operator/status.go
index b71370e..3ce2262 100644
--- a/pkg/operator/status.go
+++ b/pkg/operator/status.go
@@ -29,11 +29,12 @@ func (optr *Operator) syncAvailableStatus() error {
optrVersion, _ := optr.vStore.Get("operator")
failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing)
+ progressing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing)
message := fmt.Sprintf("Cluster has deployed %s", optrVersion)
available := configv1.ConditionTrue
- if failing {
+ if (failing && !progressing) || (failing && optr.inClusterBringup) {
available = configv1.ConditionFalse
message = fmt.Sprintf("Cluster not available for %s", optrVersion)
}
diff --git a/pkg/operator/status_test.go b/pkg/operator/status_test.go
index 1437769..9b24c97 100644
--- a/pkg/operator/status_test.go
+++ b/pkg/operator/status_test.go
@@ -350,8 +350,7 @@ func TestOperatorSyncStatus(t *testing.T) {
},
},
},
- // 3. test that if progressing fails, we report available=false because state of the operator
- // might have changed in the various sync calls
+ // 3. test that if progressing fails, we report available=true for the current version
{
syncs: []syncCase{
{
@@ -390,7 +389,7 @@ func TestOperatorSyncStatus(t *testing.T) {
},
{
Type: configv1.OperatorAvailable,
- Status: configv1.ConditionFalse,
+ Status: configv1.ConditionTrue,
},
{
Type: configv1.OperatorFailing,
@@ -405,6 +404,29 @@ func TestOperatorSyncStatus(t *testing.T) {
},
},
},
+ {
+ // we mock the fact that we are at operator=test-version-2 after the previous sync
+ cond: []configv1.ClusterOperatorStatusCondition{
+ {
+ Type: configv1.OperatorProgressing,
+ Status: configv1.ConditionFalse,
+ },
+ {
+ Type: configv1.OperatorAvailable,
+ Status: configv1.ConditionTrue,
+ },
+ {
+ Type: configv1.OperatorFailing,
+ Status: configv1.ConditionFalse,
+ },
+ },
+ syncFuncs: []syncFunc{
+ {
+ name: "fn1",
+ fn: func(config renderConfig) error { return nil },
+ },
+ },
+ },
},
},
// 4. test that if progressing fails during bringup, we still report failing and not available
@@ -601,4 +623,4 @@ func TestInClusterBringUpStayOnErr(t *testing.T) {
assert.Nil(t, err, "expected syncAll to pass")
assert.False(t, optr.inClusterBringup)
-}
\ No newline at end of file
+}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the patch above effectively enables the mco to report available=true, progressing=true, failing=true if during a progressing we get a fail, but the mco is still available (assumption from cvo doc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #450 to further discuss this.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Haproxy flake failing + ClusterIngress already marked flake /retest |
unit flake #417 /retest |
e2e-aws Haproxy flake again /retest |
- What I did
Fix a bug reported in #406 (comment)
where we weren't properly resetting the failing status.
Along with that, the second commit slightly refactor the operator to allow proper unit regression testing.
Added regression tests for the bug reported as well as other tests of other conditions.
- How to verify it
Run unit tests + custom payload on a cluster, make somehow something fails and watch the failing status clearning
- Description for the changelog
@abhinavdahiya could you please review this?